Original article was published on Deep Learning on Medium
Humans Image Segmentation with Unet using Tensorflow Keras
What we want to do here-
We want to create Segmentation of Humans (only humans for now) by using the existing libraries and resources. So, we will use the OCHuman dataset and Tensorflow for this. We will talk about all these things in this post.
We will feed images and their mask to the model and the model will produce a segmented mask of humans for our given images. We can use these segmented results for artificially blur the background of the images, in self-driving cars, change the background of the images, image editing, detecting humans in the image, and lots of possibilities.
The image below is the result of the only 44th epoch of training, there are lots of things to discuss in the article.
First of all, let me show you the key points of this article.
- What is image segmentation
- What is Unet and why we use Unet
- Why TensorFlow
- Dataset we use
- How we preprocess the data and created our custom dataset
- Model Building, Training, and Results with a different custom dataset
- What you can try
- Code (GitHub)
1. What is Image Segmentation
Image Segmentation is a broad part of Machine Vision, in image segmentation we classify every pixel of the image into one of the class.
For Example: Suppose in a below image we highlight the every pixel value of the cat.
As you can see above, how the image turned into two segments, one represents the cat and the other background.
There are 2 types of image segmentation-
1. Semantic Segmentation: Classification of each pixel into a category.
Example: If there are three cats in the picture we classify all of them as one Instance which is Cat.
2. Instance aware Segmentation, also known as Simultaneous Detection: In Instance aware Segmentation we find out the individual instance of each object.
Example: If there are three cats in the picture we identify each of them individually.
Self Driving car is one of the biggest examples of Image segmentation. In the self-driving car, we may need to classify each object (Human, Cars, Bikes, Road, Trees, etc.) individually. Well, there are lots of things to talk about self-driving cars, if you want to know about them as well let me know.
2. What is Unet and why we use Unet
First, let’s talk about CNN. CNN learns features from the images and compressed image into a feature vector, which we can use in image classification and other things.
Now, talk about Unet- In Segmentation, we need to reconstruct the image from the feature vector created by CNN. So, here we convert the feature map into a vector and also reconstruct an image from this vector.
The architecture looks like a ‘U’. In this architecture, we have Two parts Compression and Expansion. In the Compression part, we have some Convolution layers, max-pooling layers. The number of kernels or feature maps after each block doubles so that architecture can learn the complex structures.
Similarly, in Expansion block we some CNN layer and upsampling layer. The number of expansion blocks is as same as the number of compression blocks.
Loss Calculation in image segmentation? Well, it is defined simply in the paper itself.
“The energy function is computed by a pixel-wise soft-max over the final feature map combined with the cross-entropy loss function.”
We use Unet because it can reconstruct the image. We will feed some images as features and their respected mask images as labels to the model. Because of the reconstructive features of Unet, the Unet will able to generate images as output also. Here we are using a supervised learning approach.
3. Why TensorFlow
We use Tensorflow because of the rapid development of the model without worrying more about the Syntax and focus more on the architecture of the network, and fine-tuning the model.
Well, TensorFlow also provides Keras so we can use its API to create a data generator, model, and fine-tuning, etc. very easily.
But if you want you can use the Pytorch also. In PyTorch, you need to also focus on your code and need to code more. But the advantage of Pytorch is that you can play around with tensors and get little higher performance in training time.
4. Dataset we use
The dataset we use OCHuman. This dataset focus on heavily occluded human with comprehensive annotations including bounding-box, humans pose and instance mask. This dataset contains 13360 elaborately annotated human instances within 5081 images. With an average 0.573 MaxIoU of each person, OCHuman is the most complex and challenging dataset related to humans.
You can download the dataset from here.
Sample images from dataset after applying bounding-box, humans pose and instance mask-
This dataset contains the following files-
We will only use images.zip and ochuman.json.
Images.zip: Content lots of images without any bounding-box, humans pose, and instance mask. We will extract it and we will have a folder name “images” which contains images like-
ochuman.json: It is a JSON file that contains information (bounding-box, humans pose, and instance mask) related images in the “images” directory.
There is another dataset COCO available for the same task but we don’t want to use that because it has other kinds of segmentation also, apart from humans, and may need to do more preprocessing. Well, it is around 18 GB of the dataset. And the OCHuman is only around 700 MB. If we have some results then we can try the same model or different model for further training with the COCO dataset also.
So now, you have a basic idea about our dataset, Unet, and task. So now understand a little bit about our custom dataset.
5. How we preprocess the data and created our custom dataset
Why we created a custom mask (segmentation)?
I think for our task the Segmentation generated by the dataset is not so useful so I have created custom segmentation. Because we want to feed the exact segmentation mask to the model and do not want to feed extra or non-relevant information.
You can try a different kind of segmentation by altering values in the “new_mask” function below.
Now, before proceeding let me show you the API we use to generate a mask, pose of these images by using the JSON file.
You can go to this GitHub link for the installation of API. https://github.com/liruilong940607/OCHumanApi
Note: Make sure you have downloaded images.zip and extracted as folder name “images” and you have “ochuman.json”.
How we created mask-
You can write better code than this but for now, this is what I have-
5.1. Install API-
git clone https://github.com/liruilong940607/OCHumanApi
5.2. First import all the required libraries-
5.3. Read ochuman.json file-
We set Fiter=‘segm’ because we want the only segmentation of images. Well, you can play with different parameters. All the details mention on API’s GitHub repo.
So, now we have Total images: 4731 in image_ids list containing segmentation of humans.
5.4. This is the helper function that will help us to create a segmentation only for the images-
5.5. Another helper function we created, just pass an original image and segmented images generated by ochuman API. This function will create black and white a custom mask. You can change the values in the append function to generate different kinds of images. Later you can feed generated images to the model.
Read comments on line #9 and #11.
new_mask: If you want to create a black background and white human mask and vice versa use this function.
new_mask_clr: If you want to create color images. For example, purple background and yellow human mask then use this function. Read comment on line #63, #65, #67 and #70, #72, #74. Just change the value in the append function to change the color. We are using BGR format as images are read by the OpenCV in BGR format. The default color is Purple background and yellow mask (humans).
5.6. Some Parameters-
Altering these parameters may need to changes values in many other places in code, understand the working of code carefully.
5.7. Create black and white segmentation-
Instead of “new_mask” (for black and white mask) at line #9, you can use “new_mask_clr” (for purple and yellow mask) function.
The output of this function is: (2, 512, 512, 3) (2, 512, 512, 3)
Explanation- This function will return x and y. Here x is a normal image with the shape of (2, 512, 512, 3) without any Segmentation, Boundry box, etc. And y is the black and white Segmented image with the shape of (2, 512, 512, 3).
You can see the output here, you may think that why all the humans are not segmented? It is because of the dataset. In that ochuman.json file, we don’t have a segmentation of other humans in this image. We have a segmentation of only one human in the image.
5.8. Now generate all the 4731 images-
We will loop through all the 4731 images. I know it’s a little bit more hardcoded but it is fine for the data generation part.
The output of the above code-
5.9. Put it all together-
All the above code can be found in my GitHub.
Click here for this particular notebook. This notebook is only for the custom data generation part, the training notebook is a different one. I am using Google Colab, so you may need to edit a few things like change dir or etc.
6. Code Model Building-
We will use Unet for the training because it is able to regenerate the images.
It may be possible that the model learns something else, which means the model may learn the color mapping between the input image to the output image. It may learn the mapping of some color to some other color, so that’s why we have created three different datasets. We will feed three different kinds of image datasets to the model one by one by using the same architecture of Unet.
With the above notebook in point 5, we have created Three custom datasets-
6.1. Custom Dataset Human: Black Background and White Object-
6.2. Custom Dataset Human: White Background and Black Object-
6.3. Custom Dataset Human: Purple Background and Yellow Object-
We will also talk about data generators and other things but before that let’s take about model and results. You can play around with different parameters like activation, kernel_initizlizer, epochs, image_size, etc. We will use the same model for the above three datasets. The architecture we created is shown below-
This output result is for the black background dataset images. You can see that the loss decrease from a loss: 0.5708 to loss: 0.3164. And validation loss decrease from val_loss: 0.5251 to val_loss: 0.3122. Not a major change in accuracy.
The Output Image-
You can see that output is very impressive, by the end of 44 epoch we have the following results. The “epochNumber_x_input.jpg” is the input image, “epochNumber_Y_truth.jpg” is the mask input image (labels) and “epochNumber_Y_predicted.jpg” is the image generated (predicted image) by the model.
Results 6.1: Images with black background-
Result Analysis: You may notice that in the 43 predicted image (43_Y_predicted.jpg), you can see that we have a mask (43_Y_truth.jpg) for the person at the right only. The model is able to segment the person at the right and the girl also, somewhat person at the left with the black hat. Well, after 44 epoch our Google Colab got crashed.
Results 6.2: Images with white background-
Result Analysis: After 43 epochs Colab got crashed. The results here are very impressive.
Results 6.3: Images with purple background-
Result Analysis: After 43 epochs colab got crashed again. We have achieved the following results. The results are very same as results with a black background or white background.
Let’s talk about Image Generator, Training Parameters, Callbacks, Libraries, and Other things-
The code explains everything. Rest of the things available on my GitHub.
The following training code is the same for all the Notebooks (for the Three datasets we have created), the only change is the model name and directories.
Import OCHuman API-
Read JSON Annotation (labels)-
Parameters and training Images-
Features and Labels-
Train, Valid and Test Split-
Custom Keras data generator-
We will call use this function while training, it will give (return) the required batch of images.
Optional, if you want to print images generated by the “keras_generator_train_val_test”-
custom callbacks to generate intermediate outputs while training-
Get model and print summary-
Lets put it all together-
The GitHub of the above code is here. I use Google Colab for the training so you may need to change the directory according to yours. BTW, all the code(Custom dataset generator and Training) can be also found below at the “Code GitHub” Section of this post.
7. What you can try-
7.1. Transfer Learning with Unet
Well you can try transfer learning on Unet also, yes you heard right you can use transfer learning also. There is a pre-trained model of Unet is also available like vgg16 or resnet50 etc. The transfer learning will help the image compression block of Unet to learn fast and learn more. Maybe I’ll talk about this in some other article. For that, you may need to use this Github repo (Keras Unet pre-trained library).
7.2. Try Grayscale Images
Yes, you can try Grayscale images as your features and labels also. The dimension of input data will reduce to (512, 512, 1) because grayscale images have only one channel, whereas in RGB you have (512, 512, 3) three channels. So, what my intuition is in the color dataset (RGB) model may learn some color to color mapping. That may be a problem so you can try GrayScale. But remember in Grayscale images the same problem may occur because both input features and input labels (mask) both are grayscale, well I don’t know what model will learn, I haven’t tried.
7.3. Some other architecture or model
There are many different kinds of models available, instead of using U-Net you can use R-CNN, FCN, VGG-16, ResNet, etc. You can also increase or decrease the trainable parameter in Unet or these other models. Increase or decrease the Compression or Expansion block respectively in Unet.
7.4. Using GAN (Generative Adversarial Networks)
Yes, you may use GAN’s. Well, GAN is again a broad area to discuss so I am not gonna talk about it much. You can use Encoder-Decoder like system with GAN to produce images you want the model to produce. Remember GANs need lots of computational power, you may need high-end GPU or keep your Colab running for days or weeks but you can’t. In your own system, you can but you may not have NVIDIA Tesla K80 GPU at your home.
7.5. You can comment or mention here what you have done or created so others can also understand new things.
7. Code (GitHub)
You can find all the code and useful resources in this section. You can feel free to use my code and if you can mention credit for my work that would be appreciable.
You can read more on my Website: www.dipeshpal.com
You can know more about me: www.dipeshpal.in
You can watch my tech videos on YouTube: https://www.youtube.com/DIPESHPAL17
- GitHub Code: https://github.com/Dipeshpal/Image-Segmentation-with-Unet-using-Tensorflow-Keras (You can use this module to run on your system but I’ll recommend you to use Google Colab)
Transfer Learning with Unet: https://divamgupta.com/image-segmentation/2019/06/06/deep-learning-semantic-segmentation-keras.html
GitHub Transfer Learning with Unet: https://github.com/divamgupta/image-segmentation-keras
U-Net: Convolutional Networks for Biomedical Image Segmentation: https://arxiv.org/abs/1505.04597
OCHuman(Occluded Human) Dataset API: https://github.com/liruilong940607/OCHumanApi
OCHuman Dataset: https://cg.cs.tsinghua.edu.cn/dataset/form.html?dataset=ochuman
COCO Dataset: http://cocodataset.org/#home
For any credits, suggestions, changes, or anything please comment here or contact me.
Thank you so much for reading, if you found this helpful please share.