Original article can be found here (source): Deep Learning on Medium
14 | You have 5GB RAM in your machine and need to train your model on a 10 GB dataset. How do you address this?
For SVM, a partial fit would work. The dataset could be split into several smaller-size datasets. Because SVM is a low-computational cost algorithm, it may be the best case in this scenario.
In the case that the data is not suitable for SVM, a Neural Network with a small enough batch size could be trained on a compressed NumPy array. NumPy has several tools for compressing large datasets, which are integrated into common neural network packages like Keras/TensorFlow and PyTorch.
15 | Deep learning theory has been around for quite a long time, but only recently has it gained much popularity. Why do you think deep learning has surged so much in recent years?
Deep learning development is picking up pace quickly because only recently has it been necessary. Recent improvements in a shift from physical experiences to online ones mean that more data can be collected. Because of the transition of going online, there are more opportunities for deep learning to boost profits and increase customer retention that are not possible in, say, physical grocery stores. It is worthwhile noting that the two biggest machine learning models in Python (TensorFlow & PyTorch) were created by large corporate companies Google and Facebook. In addition, developments in GPU mean that models can be trained more quickly.
(Although this question is not strictly theory-related, being able to answer it means you also have your eye on the bigger picture of how your analysis can be used in a corporate sense.)
16 | How would you initialize weights in a neural network?
The most conventional way to initialize weights is randomly, initializing them close to 0. Then, a properly chosen optimizer can take the weights in the right direction. If the error space is too steep, it may be difficult for an optimizer to escape a local minima. In this case, it may be a good idea to initialize several neural networks, each in different locations of the error space, so that the chance one finds a global minima increases.
17 | What is the consequence of not setting an accurate learning rate?
If the learning rate it too low, the training of the model will progress very slowly, as the weights are making minimal updates. However, if the learning rate is set too high, this may cause the loss function to jump erratically due to drastic updates in weights. The model may also fail to converge to an error or may even diverge in the case that the data is too chaotic for the network to train.
18 | Explain the difference between an epoch, a batch, and an iteration.
- Epoch: Represents one run through the entire dataset (everything put into a training model).
- Batch: Because it is computationally expensive to pass the entire dataset into the neural network at once, the dataset is divided into several batches.
- Iteration: The number of times a batch is run through each epoch. If we have 50,000 data rows and a batch size of 1,000, then each epoch will run 50 iterations.
19 | What are three primary convolutional neural network layers? How are they commonly put together?
There are typically four different layers in a convolutional neural network:
- Convolutional layer: A layer that performs a convolutional operation that creates several picture windows, generalizing the image.
- Activation layer (usually ReLU): Brings non-linearity to the network and converts all negative pixels to zero. The output becomes a rectified feature map.
- Pooling Layer: A down-sampling operation that reduces the dimensionality of a feature map.
Usually, a convolutional layer is consisted of several iterations of convolutional layer, activation layer, and pooling layer. Then, it may be followed with one or two additional dense or dropout layers for further generalization, and finished with a fully connected layer.
20 | What is a dropout layer and how does it help a neural network?
A dropout layer reduces overfitting in a neural network by preventing complex co-adaptions in the training data. A dropout layer acts as a mask, randomly preventing connections to certain nodes. Rephrased, during training, about half of the neurons in a Dropout layer will be deactivated, forcing each node to carry more information that was left out by the deactivated neurons. Dropouts are sometimes used after max-pooling layers.
21 | On a simplified and fundamental scale, what makes the newly developed BERT model better than traditional NLP models?
Traditional NLP models, to familiarize themselves with the text, are given the task of predicting the next word in a sentence, for example: ‘dog’ in “It’s raining cats and”. Other models may additionally train their models to predict the previous word in a sentence, given the context after it. BERT randomly masks a word in the sentence and forces the model to predict that word with both the context before and after it, for example: ‘raining’ in “It’s _____ cats and dogs.”
This means hat BERT is able to pick up on more complex aspects of language that cannot simply be predicted by previous context. BERT has many other features like various layers of embeddings, but on a fundamental scale, its success comes from how it reads the text.
22 | What is Named-Entity Recognition?
NER, also known as entity identification, entity chunking, or entity extraction, is a subtask of information extraction that tries to locate and classify named entities mentioned in unstructured text into categories such as names, organization, locations, monetary values, time, etc. NER attempts to separate words that are spelled the same but mean different things and to correctly identify entities that may have sub-entities in their name, like ‘America’ in ‘Bank of America’.
23 | You are given a large dataset of tweets, and your task is to predict if they are positive or negative sentiment. Explain how you would preprocess the data.
Since tweets are full of hashtags that may be of valuable information, the first step would be to extract hashtags and perhaps create a one-hot encoded set of features, in which the value is ‘1’ for a tweet if it has a hashtag and ‘0’ if it doesn’t. The same can be done with @ characters (whichever account the tweet is directed at may be of importance). Tweets are also cases of writing that is compressed (since there is a character limit), so there will probably be lots of purposeful misspellings that will need to be corrected. Perhaps the number of misspellings in a tweet would be helpful as well — maybe angry tweet have more misspelled words.
Removing punctuation, albeit standard in NLP preprocessing, may be skipped in this case because the use of exclamation marks, question marks, periods, etc. may be valuable when used in conjunction with other data. There may be three or more columns where the value for each row is the number of exclamation marks, question marks, etc. However, when feeding the data into a model the punctuation should be removed.
Then, the data would be lemmatized and tokenized, and there is not just the raw text to feed into the model but also knowledge about hashtags, @s, misspellings, and punctuation, all of which will probably assist accuracy.
24 | How might you find the similarity between two paragraphs of text?
The first step is to convert the paragraphs into a numerical form, with some vectorizer of choice, like bag of words or TD-IDF. In this case, bag of words may be better, since the corpus (collection of texts) is not very large (2). In addition, it may be more true to the text, since TD-IDF is primarily for models. Then, one could use cosine similarity or Euclidean distance to compute the similarity between the two vectors.
25 | In a corpus of N documents, one randomly chosen document contains a total of T terms. The term ‘hello’ appears K times in that document. What is the correct value for the product of TF (Term Frequency) and IDF (Inverse Document Frequency), if the term ‘hello’ appears in about one third of the total documents?
The formula for Term Frequency if K/T, and the formula for IDF is the logarithm of the total documents over the number of documents containing the term, or log of 1 over 1/3, or log of 3. Therefore, the TF-IDF value for ‘hello’ is K * log(3)/T.
26 | Is there a universal set of stop words? When would you increase the ‘strictness’ of stop words and when would you be more lenient on stop words? (Being lenient on stop words means decreasing the amount of stop words eliminated from the text).
There are generally accepted stop words stored in the NLTK library in Python, but in certain contexts they should be lengthened or shortened. For example, if given a dataset of tweets, the stop words should be more lenient because each tweet does not have much content to begin with. Hence, more information will be packed into the brief amount of characters, meaning that it may be irresponsible to discard what we deem to be stop words. However, if given, say, a thousand short stories, we may want to be harsher on stop words to not only conserve computing time but also to differentiate more easily between each of the stories, which will probably all use many stop words several times.