Deep Learning: Build a dog detector and breed classifier using CNN

Source: Deep Learning on Medium


You might think recognizing a dog’s breed in an image is an easy task for you. You are right! It might not be difficult to find dog breed pairs with minimal inter-class variation for instance, Curly-Coated Retrievers and American Water Spaniels.

But, how about these two?!

Huh, not that easy!

In the last decade, it is much easier to use deep learning techniques with a few lines of python code to distinguish between dog breeds in images. In this blog, I will walk you through how to create Convolutional Neural Networks (CNN) from scratch and leverage the latest state of art image classification techniques on ImageNet. This model can be used as part of a mobile or web app for the real world and user-provided images. Given an image to the model, it determines if a dog is present and returns the estimated breed. If the image is human, it will return the most resembling dog breed. You can find the code in my GitHub repo.

Human detector

I used OpenCV’s implementation of the Haar feature-based cascade object classifier to detect human faces. The cascade function is a machine learning-based approach trained on many images with positive(with a face) and negative(without any face) labels. The detectMultiScale gets the coordinates of all the faces then returns them as a list of rectangles. But don’t forget to convert the RGB image into grayscale before using it.

The following face_detector function counts up how many human faces are in the photo:

def face_detector(img_path):
img = cv2.imread(img_path)
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
faces = face_cascade.detectMultiScale(gray)
return len(faces) > 0

The performance of face_detector evaluated on 100 samples of human and dog images. This detector recognizes all the human faces from human data but didn’t perform well on the dog dataset. It had about 11% false positives.

Dog detector

It is time to use another detector that performs better on the dog dataset. I used pre-trained ResNet50 weights on ImageNet in Keras, which is trained on over 10 million images containing 1000 labels.

In the following code, paths_to_tensor takes the path to an image and returns a 4D tensor ready for ResNet50. But, all the pre-trained models in Keras need additional processing like normalization that can be done by using preprocess_input. The dog_detector function returns “True” if a dog is detected in the image stored at img_path.

from keras.preprocessing import image 
from tqdm import tqdm
from keras.applications.resnet50 import preprocess_input, decode_predictions
def path_to_tensor(img_path):
img = image.load_img(img_path, target_size=(224, 224)
return np.expand_dims(x, axis=0)
def paths_to_tensor(img_paths):
list_of_tensors = [path_to_tensor(img_path) for img_path in tqdm(img_paths)]
return np.vstack(list_of_tensors)
def ResNet50_predict_labels(img_path):
img = preprocess_input(path_to_tensor(img_path))
return np.argmax(ResNet50_model.predict(img))
def dog_detector(img_path):
prediction = ResNet50_predict_labels(img_path)
return ((prediction <= 268) & (prediction >= 151))

To assess the dog detector, I checked if the predicted class of RestNet50 on ImageNet falls into the dog breed categories. The dog detector performs well without any false negatives.

Now that we recognize dogs in images, it is time to predict breeds.

Classify dog breeds

Here I created a 4-layer CNN in Keras with Relu activation function. The model starts with an input image of 224 *224*3 color channels. This input image is big but very shallow, just R, G, and B. The convolution layers squeeze images by reducing width and height while increasing the depth layer by layer. By adding more filters, the network can learn more significant features in the photos and generalize better.

My first layer produces an output with 16 feature channels that is used as an input for the next layer. The filters are the collection of 16 square matrices, output feature maps, which are weighted sums of input features and kernel. The kernel’s weights are calculated during the training process by ImageNet data, and what it does is to slide across the input feature maps and produce output features. So, the shape of output features depends on the size of the kernel and input features.

Check this page for a better understanding of how CNN works.

I think it would be ideal for input and output features to have the same size. So, I decided to use same padding to go off the edge of images with zero pads for all the layers with the stride of 2. I also used the max-pooling operation to ensure I am not losing information in the image while lowering the chance of overfitting. Max pooling takes the maximum of pixels around a location.

After four Convolutional layers and max-pooling, followed by two fully connected layers, I trained the classifier. Practically, the Convolutional layers extract the image features, and the classifier classifies them based on the previously obtained features.

The model architecture is:

from keras.layers import Conv2D, MaxPooling2D, GlobalAveragePooling2D
from keras.layers import Dropout, Flatten, Dense
from keras.models import Sequential
# Model Architecture
model = Sequential()
model.add(Conv2D(filters=16, kernel_size=2, padding='same', activation='relu', input_shape=(224,224,3)))

model.add(Conv2D(filters=32, kernel_size=2 , padding='same' , activation='relu'))
model.add(Conv2D(filters=64 , kernel_size=2 , padding='same' , activation='relu'))
model.add(Conv2D(filters=128 , kernel_size=2 , padding='same' , activation='relu'))
model.add(Dense(512, activation='relu'))

The created model from scratch is not performing well, accuracy 12%, due to not having enough images data to train the model. This dataset has about 8,351 total dog images. One potential improvement is data augmentation to add more data. This strategy modifies the images by padding, cropping, and rotating images randomly to expand more data for the network. It also enables the model to generalize better without overfitting, of course, with the appropriate parameters.

Usually, training a CNN model made from scratch on a small data like this will lead to underfitting and with so many layers, and parameter tuning often causes overfitting. So, it is time to utilize transfer learning pre-trained networks to create a CNN breed classifier even though these models are not explicitly made for this task. But one advantage of these networks is that they are trained on large datasets, ImageNet with millions labeled images and reached the 90% accuracy. Also, they can generalize to other images outside the ImageNet.

The pre-trained models I implemented are VGG, Inception V3, and ResNet in Keras. For each one, I removed the original classifier and added a new one that fits the dog breed predictions purpose the best then fine-tuned the network. Ideally, we should apply appropriate fine-tuning strategies based on the size of data, the model architecture, and task purposes. For all the used networks, I changed the last fully connected layer to the number of dog breeds, 133, and froze other pre-trained layers.

# model Architecture
Xception_model = Sequential()
Xception_model.add(Dense(133, activation='softmax'))
checkpointer = ModelCheckpoint(filepath='saved_models/', verbose = 0, save_best_only=True)sgd = SGD(lr= 1e-3 , decay=1e-6, momentum=0.9 , nesterov = True)# compile the model
Xception_model.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=['accuracy'])
# Training the model , train_targets,
validation_data = (valid_Xception, valid_targets),
shuffle = True,
batch_size = 20,
epochs = 25,
verbose = 1)

The Xception model outperforms all the other models with an accuracy of 84%. I think VGG16 and VGG19 performances could be improved by applying other fine-tuning techniques. For example, training a few top layers related to specific dog features and freeze the others that detect more generic features. Because they are already captured in the ImageNet weights. Besides that, Inception based models slightly outperform VGG and Resnet on ImageNet; also, it is more computationally efficient.

I selected Xception network as my model architecture and again fine-tuned it by minimizing the cross-entropy loss function using stochastic gradient descent and learning rate of 0.001; The accuracy reached close to 86%.

Model Evaluation

Now, it is time to combine the detectors and Xception-predict-breed model. This code takes a path to an image and first determines whether the image is a dog, human or neither then returns the predicted resembling breed:

def Xception_predict_breed (img_path):
bottleneck_feature = extract_Xception(path_to_tensor(img_path))
predicted_vector = Xception_model.predict(bottleneck_feature)
return dog_names[np.argmax(predicted_vector)]
def display_img(img_path):
img = cv2.imread(img_path)
cv_rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
imgplot = plt.imshow(cv_rgb)
return imgplot
def breed_identifier(img_path):
prediction = Xception_predict_breed(img_path)
if dog_detector(img_path) == True:
print('picture is a dog')
return print (f"This dog is a {prediction}\n")

if face_detector(img_path) == True:
print('picture is a human')
return print (f"This person looks like a {prediction}\n")

return print('The picture is neither dog nor human')

Let’s look at the results: