Multi-Perspective Neural Networks — An unsupervised deep metric learning algorithm

Original article was published by Tekin Evrim Ozmermer on Deep Learning on Medium

Multi-Perspective Neural Networks — An unsupervised deep metric learning algorithm

An unsupervised deep metric learning algorithm that learns to generate embedding.

Links: Demonstration video & Github repository

Imagine that you are a baby that just opened your eyes to the world. Imagine the lack of consciousness, imagine not knowing anything. You don’t even know how to see something, let alone know how to recognize something. Now, come back to your consciousness. You can see, you can recognize any object around you and those objects remind you even of your memories.

Photo by Christian Bowen on Unsplash

But, how did you acquired all these skills? This is the starting point of Multi-Perspective Neural Networks(MPNN).

In my previous article, I have talked about an unsupervised learning algorithm that can learn filters that create meaningful features as output. The link to the post: Learning Filters with Unsupervised Learning. This article is a continuation of the previous article. There will be three sections in this article such as logic and philosophy, code, test.

Logic and Philosophy

Photo by sergio souza on Unsplash

The algorithm basically learns to decompose the output of convolution layers from each other. When we apply a convolution operation to an image, we get the NxRxC matrix where N is the number of filters and, R and C are row size and column size. In MPNNs, the N number of layers in the output of convolution operation is tried to be decomposed from each other. When the maximum similarity value is under a predefined threshold, the learning process for that level is complete. After that, this process is repeated for the next levels. In the end, the neural network generates an embedding that has meaningful features.

I started thinking about the philosophy of this operation. I thought about the concept that matches with this operation of layer decomposition. The concept that I was looking for was Perspective. Human beings try to learn to look at the events, objects, and everything else with different perspectives so that ey can analyze what is going on. This happens not only in high-level abstraction but also in low-level learning. What MPNN tries to do is to apply this perspective generation to the low-level learning process in vision.


I have shared the code in my previous post. But, I have done some changes, therefore I would like to share it again in this post as well. The full code can still be found in the GitHub repository.

One single convolution layer class.

class ConvLayer(torch.nn.Module):
def __init__(self, in_channels, out_channels, kernel_size, stride):
super(ConvLayer, self).__init__()
self.conv2d = torch.nn.Conv2d(in_channels, out_channels, kernel_size, stride)
def forward(self, x):
out = self.conv2d(x)
return out

One single convolution level class. The level is the place where the decomposition happens.

class SSNet(torch.nn.Module):
def __init__(self,in_filters, out_filters):
super(SSNet, self).__init__()
self.conv1 = ConvLayer(in_filters, 64, kernel_size = 5, stride = 1)
self.conv2 = ConvLayer(64, out_filters, kernel_size = 1, stride = 1)
self.pool = nn.AvgPool2d(2, stride=2)
self.relu = torch.nn.ReLU()

def forward(self, x):
out = self.pool(self.conv2(self.relu(self.conv1(x))))
return out

The MPNN class where several levels of SSNet classes are contained.

class SSNetMultiple(torch.nn.Module):
def __init__(self,levels = 5):
super(SSNetMultiple, self).__init__()
self.children = []
for cnt in range(levels):
if cnt == 0:
in_filters, out_filters = 3,16
elif cnt == levels-1:
in_filters, out_filters = 16,16
in_filters, out_filters = 16,16
self.children.append(SSNet(in_filters, out_filters))

self.main = nn.Sequential(*self.children)

def forward(self, x, queue = 1):
outs = [x]
for cnt,child in enumerate(self.main):
if cnt<queue:
return outs[-1]

Normalization operation. Normalization is necessary if we want to get similarity value 1 as a maximum similarity.

def normalize(vector):
norm = vector.norm(p=2, dim=0, keepdim=True)
vector_normalized = vector.div(norm.expand_as(vector))
return vector_normalized

The similarity function is used to extract and combine the similarities of the layers so that we can calculate the loss.

def sim_func(layers):
combinations = list(itertools.combinations(np.arange(0,layers.shape[1]), 2))
similarity_vector = torch.empty(len(combinations))
for cnt,comb in enumerate(combinations):
first = layers[0][comb[0]].flatten()
second = layers[0][comb[1]].flatten()
first_norm = normalize(first)
second_norm = normalize(second)
similarity_vector[cnt] = torch.matmul(first_norm,second_norm.T)
return similarity_vector

Define the MPNN instance with the number of decomposition levels.

model = SSNetMultiple(levels = 4)

For the dataset, I have used MNIST in the previous article. This time, we will use a video that is downloaded from YouTube.

The video can be found in the link: The Allure of Ibiza, Spain

To capture the frames from video, I have used OpenCV. We need to capture the frame, apply center-crop, resize, and transform to PyTorch tensor.

def cam_to_tensor(cam):
if cam.isOpened():
ret, frame_ =
cam = cv2.VideoCapture(video_source)
ret, frame_ =
frame = cv2.cvtColor(frame_, cv2.COLOR_BGR2RGB)
frame_pil = Image.fromarray(frame)
image = transform(frame_pil)
return image, frame_, cam

Now, the all training script. For the commented lines, please check my explanation in the previous post. The link can be found on top of this article.

Firstly, we capture one frame from the video. Then we prepare the frame to be fed to the model. After that, the model trains the first level of the decomposition. When the maximum similarity value is lower than 0.3, we start training the next level and so on. Not to forget, the frames that capture from the video are scenes from a city tour.

lr = 0.02
optimizer = optim.SGD(model.parameters(), lr=lr)
lossfunc = nn.MSELoss()
video_source = "./videoplayback.mp4"
cam = cv2.VideoCapture(video_source)
loss_obs = 0
epoch = 0
while epoch<4:
# if epoch>0:
# for cc,param in enumerate(model.main[epoch-1].parameters()):
# print(epoch-1,"grad is deactivated")
# param.requires_grad = True
for cnt in range(0,120000):
image, _, cam = cam_to_tensor(cam)

out = model(image.unsqueeze(0), queue = epoch+1)
sim_vec = sim_func(out)
loss = lossfunc(sim_vec, torch.zeros(sim_vec.shape))
loss_obs_ = torch.max(torch.abs(sim_vec-torch.zeros(sim_vec.shape)))
loss_obs += loss_obs_
print("Epoch: {}\tSample: {}\tLoss: {}\tLR: {}".format(epoch,cnt,loss_obs_,optimizer.param_groups[0]["lr"]))
if cnt%20 == 0 and cnt!=0:
loss_obs = loss_obs/20
print("Epoch: {}\tSample: {}\tLoss: {}\tLR: {}".format(epoch,cnt,loss_obs,optimizer.param_groups[0]["lr"]))
if loss_obs<0.30:
epoch += 1
loss_obs = 0


Our model has learned to see city scenes. How can we observe what it has learned? We can capture random frames and compare those frames with the following frames. By doing so, we can observe the similarity values along with the frames and see the effect of the occurrences of each scenery in the frame. Example: If there is a specific object such as a human, window, etc. in the anchor frame and if there is an occurrence of that object in the following frames even though the scenery has changed and if the similarity value is somewhat high, then we understand that the model can extract features that help it recognize that object in the scenery.

def generate_embedding(model,cam,queue = 3):
image, frame, _ = cam_to_tensor(cam)
embedding = model(image.unsqueeze(0), queue = queue).flatten()
return embedding, frame
def compare_samples(e1,e2):
first_norm = normalize(e1.flatten())
second_norm = normalize(e2.flatten())
return torch.matmul(first_norm,second_norm.T).detach().numpy()
embedding_list = []
def compare_continuous(model,cam,queue):
min_sim = 1
max_diff = 0

bottomLeftCornerOfText = (10,100)
fontScale = 1
fontColor = (255,255,255)
lineType = 2

last_sim_list = []
cnt_f = 0
while True:
if cnt_f%300==0:
e1, f1 = generate_embedding(model,cam,queue = queue)
cv2.imshow('frame 1', f1)

e2, f2 = generate_embedding(model,cam,queue = queue)
embedding_list_np = np.array(embedding_list)
std = np.std(embedding_list_np, axis=0)
pca_idx = std.argsort()[-64:][::-1]

e1_pca = e1[pca_idx.tolist()]
e2_pca = e2[pca_idx.tolist()]

sim = compare_samples(e1_pca,e2_pca)

cv2.putText(f2,'Similarity: {}'.format(sim),
cv2.imshow('frame 2', f2)
if cv2.waitKey(25) & 0xFF == ord('q'):

cnt_f += 1

And, the play button.


The demonstration video can be found in the link.