[Week 6 Object Detection and Room Classification with Deep Learning]

Source: Deep Learning on Medium

Team Members: Ahmet Tarık KAYA, Ayça Meriç ÇELİK, Kaan MERSİN

Hello again! Today, we will talk about the progress in the first stage of our project, which is the semantic segmentation of room images. We have been decided to implement a Fully Convolutional Layer as our base model before. Our purpose was to compare the result of this model and PSPNet, which is a more complex one. This week, we followed a tutorial by James Le, which had been published on Medium.

Image Segmentation with Fully-Convolutional Net:
The Fully-Convolutional Net (FCN) is the most popular architecture for semantic segmentation.

[1]The original Fully Convolutional Network (FCN) learns a mapping from pixels to pixels, without extracting the region proposals. The FCN network pipeline is an extension of the classical CNN. The main idea is to make the classical CNN take as input arbitrary-sized images. The restriction of CNNs to accept and produce labels only for specific sized inputs comes from the fully-connected layers which are fixed. Contrary to them, FCNs only have convolutional and pooling layers which give them the ability to make predictions on arbitrary-sized inputs.

Image 1. Structure of FCN

The model of the tutorial uses TensorFlow library in Python 3, along with other dependencies such as Numpy and Scipy. Let’s see how it works!

[1]Here are the key features of the FCN architecture:

FCN transfers knowledge from VGG16 to perform semantic segmentation.
The fully connected layers of VGG16 is converted to fully convolutional layers, using 1×1 convolution. This process produces a class presence heat map in low resolution.
The upsampling of these low-resolution semantic feature maps is done using transposed convolutions (initialized with bilinear interpolation filters).
At each stage, the upsampling process is further refined by adding features from coarser but higher resolution feature maps from lower layers in VGG16.
Skip connection is introduced after each convolution block to enable the subsequent block to extract more abstract, class-salient features from the previously pooled features.
There are 3 versions of FCN (FCN-32, FCN-16, FCN-8). We’ll implement FCN-8, as detailed step-by-step below:

Encoder: A pre-trained VGG16 is used as an encoder. The decoder starts from Layer 7 of VGG16.
FCN Layer-8: The last fully connected layer of VGG16 is replaced by a 1×1 convolution.
FCN Layer-9: FCN Layer-8 is upsampled 2 times to match dimensions with Layer 4 of VGG 16, using transposed convolution with parameters: (kernel=(4,4), stride=(2,2), paddding=’same’). After that, a skip connection was added between Layer 4 of VGG16 and FCN Layer-9.
FCN Layer-10: FCN Layer-9 is upsampled 2 times to match dimensions with Layer 3 of VGG16, using transposed convolution with parameters: (kernel=(4,4), stride=(2,2), paddding=’same’). After that, a skip connection was added between Layer 3 of VGG 16 and FCN Layer-10.
FCN Layer-11: FCN Layer-10 is upsampled 4 times to match dimensions with input image size so we get the actual image back and depth is equal to the number of classes, using transposed convolution with parameters: (kernel=(16,16), stride=(8,8), paddding=’same’).

Our First Problem: GPU supported training
Training a FCN is a costly operation in terms of both time and memory. So it is highly recommended to use a GPU for training. Our computers have suitable graphics cards, so we decided to install CUDA and cuDNN to prepare them for the process.

[2]CUDA is a parallel computing platform and application programming interface (API) model created by Nvidia. It allows software developers and software engineers to use a CUDA-enabled graphics processing unit (GPU) for general purpose processing — an approach termed GPGPU (General-Purpose computing on Graphics Processing Units). The CUDA platform is a software layer that gives direct access to the GPU’s virtual instruction set and parallel computational elements, for the execution of compute kernels.

[4]The NVIDIA CUDA® Deep Neural Network library (cuDNN) is a GPU-accelerated library of primitives for deep neural networks. cuDNN provides highly tuned implementations for standard routines such as forward and backward convolution, pooling, normalization, and activation layers.

We were able to install CUDA 9.2 successfully, but the required version for this tutorial was 9.0. Thus, we followed these instructions to install CUDA v9.0 and cuDNN 7.2 in Ubuntu 18.04. It was a tricky job, we messed up in lots of steps. :) But eventually, we succeed, yay!

Let’s dive into it, but…

Our happiness lasted for a pretty short time. We got a ResourceExhaustedError while loading our VGG16 model. We basically ran out of GPU memory in execution. We realized that we do not have enough resources to complete this task locally. So we looked for solutions and decided to use cloud computing platforms. As a result, Amazon Web Services became our fellow. :)

We call it Fellowship of DL :)

Our Solution: Amazon Web Services

We created an instance, which is a t2.micro type in free tier plan. The instance supports GPU training with different versions of CUDA. We connected to the machine and began to upload our files. Currently, we are learning how to use the service.

Image 2. Properties of our instance

We will come up with the results of segmentation so soon. Stay tuned for the updates, and thank you for following the development of this project!


  1. https://medium.com/nanonets/how-to-do-image-segmentation-using-deep-learning-c673cc5862ef
  2. https://en.wikipedia.org/wiki/CUDA
  3. https://gist.github.com/Mahedi-61/2a2f1579d4271717d421065168ce6a73
  4. https://developer.nvidia.com/cudnn