The mentor-curated study guide to survive the Coursera Deep Learning Specialization course 4
The first three courses of the Coursera Deep Learning Specialization were bearably tough, but then came course 4. So many great topics and concepts! But countless times stopping the videos, note taking, and lecture rewatching led us, a group of official mentors, to decide a learner study guide is worth the effort.
Part I of this study guide trilogy reviews the broad concepts covered in this course. What are Convolutional Neural Networks and how does YOLO actually work? Part II summarizes every single lecture and dives deeper into explaining the top-level concepts. Part III will offer a deeplearning.ai dictionary to help you sort through the jungle of acronyms, technical terms and occasional jokes from grandmaster Ng and will be published once we’ve finished course 5.
Let’s start by breaking down the most interesting concepts of the CNN course one by one.
Convolutional Neural Networks
What is a Convolutional Neural Network?
Convolutional Neural Networks (CNNs) are the premier deep learning model for computer vision. Computer vision has become so good that it currently beats humans at certain tasks, e.g. identifying breeds of cats and dogs, and CNNs play a major part in this success story. If you have a task that involves computer vision, let it be recognizing faces or objects, CNNs are the go-to model.
How do CNNs work?
CNNs are used to evaluate inputs through convolutions. The input is convolved with a filter, as shown in the gif above. This convolution leads the network to detect edges and lower-level features in earlier layers and more complex features in deeper layers in the network. CNNs are used in combination with pooling layers and they often have fully connected layers at the end, as you can see in the picture below. Run forward propagation like you would in a vanilla neural network and minimize the loss function through backpropagation to train the CNN.
Certain architectures like ResNets or InceptionNets exist to speed up training the CNN. Processing vast amounts of images and training the weights takes its time because there are so many connections. Luckily, many great CNNs have already been trained, like ImageNet or VGG, and you can reuse their models. Andrew Ng’s advice is to use transfer learning with an existing CNN architecture and a pre-trained model to get started with your computer vision task quickly.
Object detection through YOLO
What is YOLO?
YOLO is a multiple-object detection algorithm which also works in real-time. Picture self-driving cars needing to identify cars, pedestrians, and traffic lights while driving or simply annotating a movie. YOLO is so fast because “you only look once”, meaning you run a single forward propagation step and you immediately know exactly where an object is in an image and which class this object belongs to. As a cherry on the top, YOLO is able to detect multiple objects in an image.
How does YOLO work?
To train a CNN using YOLO, you first have to place a grid over your training image, e.g. 3×3. Next, you create the output labels for every grid. Draw a bounding box around the objects in each grid and label the output vectors accordingly. Label as many images as possible. The final layer of your CNN has the shape of the grid cells from the input image for width and height and as many channels as the number of elements in a single output vector.
If you want to detect three classes of objects, e.g. sun, moon or star like the picture above, your output vector has 8 elements. The final layer is 3x3x8 because we use a 3×3 grid for the input image and the output vector has 8 elements. At the first position, indicate whether an object exists or not. Use 4 elements to indicate the object center in the grid and its width and height, plus three elements which indicate the class of the object.
Backpropagation adjusts the weights of the CNN so that it learns to identify the objects. You can use non-max suppression to identify the best bounding box for the object. If you encounter multiple objects overlapping in the same grid cell, you can use anchor boxes to separate these objects. These details are explained more thoroughly in Part II.
What is Face Recognition?
Face recognition is used to identify a person based on an image of their face. While face verification is concerned with verifying if a person is whom they claim to be based on their face, face recognition is much more complex because you’re trying to match the face of that person to a database of face images. Additionally, you often have to identify a person through one-shot learning, meaning you have to identify her based on one image and check if it similar enough to any image in the database — pretty tough!
How does Face Recognition work?
Your goal is to learn a similarity function, e.g. triplet loss. The similarity function aims to detect if the people are identical on different sets of images. The triplet loss function requires three images to calculate the similarity: an anchor, a positive and a negative example of that person. The triplet loss function adjusts the weights to maximize the difference between the positive and the negative image based on the anchor image. Based on the output of triplet loss, the CNN decides if it recognizes the person or not. Make sure you use hard to train images for the similarity function.
Neural Style Transfer
What is it?
Neural Style Transfer is a fun application and will improve your understanding of CNNs. In its essence, you try to generate a new image which combines the content of one image with the style of another image, say from a popular artist. Do you want to know how Picasso would have painted you? Go ahead and try it out for yourself with Neural Style Transfer!
How does Neural Style Transfer work?
In Neural Style Transfer, you start with a generated image G, which contains random pixel values, as shown below. Next, you define a content image C and a style image S, which you want to combine. Your goal is to adjust the pixel values in G so that G becomes similar to both C and S. To do so, you define the cost functions J(C) and J(S) and try to minimize both.
J(C) makes sure that G looks similar to the content in C. You know that CNNs learn to recognize lower-level features like edges in earlier hidden layers and more complex features like faces in later hidden layers. Pick a hidden layer in the middle of the CNN and run forward propagation with C and G. Next, you compare the activation values for both images and try to minimize the difference between the activation values through backpropagation.
Next, you also have to adjust the style of G to match the style in S. The key to minimize J(S) is to adjust the correlation between the channel activations in G to match S. You do that through calculating the Gram matrices for S and G. The Gram matrix calculates every possible pair of filter combinations. Next, you choose a layer in the middle of the CNN and run forward propagation again for S and G. J(S) minimizes the difference between the gram matrices through backpropagation and makes G look more similar to S.
The cool thing is that the neural network learns to adjust pixel values and not only weights! It’s a very visual way of investigating and understanding CNNs and I encourage you to create your own Neural Style Transfer images.
Disclaimer: All credit is due to deeplearning.ai. While I’m a mentor, I’m merely summarizing and rephrasing the content to help learners progress.
Part I is a wrap, off to Part II. If you think this post was helpful, don’t forget to show your ? through ? ? ? and follow me to hear more articles about Deep Learning, Online Courses, Autonomous Cars, and Life. Also check these posts about the Deep Learning Specialization. Please comment to share your opinion. Cheers! ?
Source: Deep Learning on Medium