Source: Deep Learning on Medium
12 Deep Learning Interview questions you should not be missed (Part 1)
These are the questions that I often asked when interviewing an AI Engineer position. In fact, not all interviews need to use all of these questions because it depends on the experience and the projects that the candidate has done before. Through a lot of interviews, especially with students, I have gathered a collection of 12 most exciting interview questions in Deep Learning that today will share back to you in this article. I hope to receive many comments from you. OK, no more rambling, let’s get started.
1. Presenting the meaning of Batch Normalization
This can be considered a very good question because it covers most of the knowledge that candidates need to know when working with a neural network model. You can answer differently but need to clarify the following main ideas:
Batch Normalization is an effective method when training a neural network model. The goal of this method is to want to normalize the features (the output of each layer after going through the activation) to zero-mean state with standard deviation 1. So the opposite phenomenon is non-zero mean How does it affect the model training:
- Firstly, it can be understood that Non zero mean is a phenomenon where data is not distributed around the value of 0, but the data has most values greater than zero, or less than zero. Combined with the high variance problem, data becomes very large or very small. This problem is common when training neural networks with deep layer numbers. The fact that the feature is not distributed within stable intervals (small to large values) will have an effect on the optimization process of the network. As we all know, optimizing a neural network will need to use derivative calculations. Assuming a simple layer calculation formula is y = (Wx + b), the derivative of y from w looks like: dy = dWx. Thus the value of x directly affects the value of the derivative (of course, the concept of gradients in neural network models cannot be so simple but theoretically, x will affect the derivative). Therefore, if x brings unstable changes, the derivative may be too big, or too small, resulting in an unstable learning model. And that also means we can use higher learning rates during training when using Batch Normalization.
- Batch normalization can help us avoid the phenomenon that the value of x falls into saturation after going through non-linear activation functions. So it makes sure that no activation is exceeded either too high or too low. This helps the weights that when not using the patient will probably never learn, now they are normally learned. This helps us reduce the dependence on the initial value of the parameters.
- Batch Normalization also acts as a form of regularization that helps to minimize overfitting. Using batch normalization, we won’t need to use too many dropput and this makes sense since we won’t need to worry about losing too much information when we drop down the network. However, it is still advisable to use a combination of both techniques.
2. Present the concept and trade-off relationship between bias and variance?
What is bias? Understandably, bias is the difference between the average prediction of the current model and the actual results that we need to predict. A model with a high bias indicates that it is less focused on training data. This makes the model too simple and does not achieve good accuracy on both training and testing. This phenomenon is also known as underfitting.
Variance Can simply understand as the distribution (or clustering) of the model outputs on a data point. The larger the variance, the more likely it is that the model is paying close attention to training data and does not provide a generalization on data never encountered. As a result, the model achieved extremely good results on the training data set, but the results were very poor with the test data set. This is the phenomenon of overfitting.
The correlation between these two concepts can be visualized in the following figure:
In the diagram above, the centre of the circle is a model that perfectly predicts the exact values. In fact, you have never found such a good model. As we get farther away from the centre of the circle, our predictions get worse and worse.
We can change the model so that we can increase the number of model guesses that fall into the centre of the circle as much as possible. A balance between Bias and Variance values is needed. If our model is too simple and has very few parameters then it may have high bias and low variance.
On the other hand, if our model has a large number of parameters then it will have high variance and low bias. This is the basis for us to calculate the complexity of the model when designing the algorithm.
3. Suppose that the Deep Learning model has found 10 million faces vectors. How to find a new face fastest by query.
This question is about the application of Deep Learning algorithms in practice, the key point of this question is the method of indexing data. This is the final step in the problem of applying One Shot Learning for face recognition but it is the most important step that makes this application easy to deploy in practice.
Basically, with this question, you should present an overview of face recognition method by One Shot Learning first. It can be understood simply as turning each face into a vector, and the new face recognition is finding the vectors that are closest to (most similar) to the input face. Usually, people will use a deep learning model with a custom loss function called triplet loss to do that.
However, with the increase in the images number at the beginning of the article, calculating the distance to 10 million vectors in each identification is not a smart solution, makes the system much slower. We need to think of methods of indexing data on real vector space in order to make the query more convenient.
The main idea of these methods is to divide the data into easy structures for querying new data (possibly similar to a tree structure). When new data is available, querying in the tree helps to quickly find the vector that has the closest distance with time very quickly.
There are several methods that can be used for this purpose such as Locality Sensitive Hashing — LSH, Approximate Nearest Neighbors Oh Yeah — Annoy Indexing, Faiss…
4. With classification problem, is the accuracy index completely reliable? Which metrics do you usually use to evaluate your model?
With a class problem, there are many different ways to evaluate. As for accuracy, the formula simply takes the number of correct prediction data points divided by the total data. This sounds reasonable, but in reality, for unbalanced data problems, this quantity is not significant enough. Suppose we are building a prediction model for network attacks (assuming attack requests account for about 1/100000 number of requests).
If the model predicts that all requests are normal, then the accuracy is also up to 99.9999% and this figure is often unreliable in the classification model. The accuracy calculation above usually shows us how many percent of the data is correctly predicted, but does not indicate how each class is classified in detail. Instead, we can use the Confusion matrix. Basically, Confusion matrix shows how many data points actually belong to a class, and is predicted to fall into a class. It has the following form:
In addition to expressing the change of True Positive and False Positive indices corresponding to each threshold that defines the classification, we have a graph called Receiver Operating Characteristic — ROC. Based on ROC we can know whether the model is effective or not.
An ideal ROC is the closer the orange line to the top left corner (i.e., True Positive is high and False Positive is lower) the better.
to be continued…