Original article was published by The Educative Team on Artificial Intelligence on Medium
Intermediate Questions (15)
These intermediate questions take the basic theories of ML from above and apply them in a more rigorous way.
1. Which cross-validation technique would you choose for a time series dataset?
A time series is not randomly distributed but has a chronological ordering. You want to use something like forward chaining so you can model based on past data before looking at future data. For example:
- Fold 1 : training , test 
- Fold 2 : training [1 2], test 
- Fold 3 : training [1 2 3], test 
- Fold 4 : training [1 2 3 4], test 
- Fold 5 : training [1 2 3 4 5], test 
2. How do you choose a classifier based on a training set size?
For a small training set, a model with high bias and low variance models is better, as it is less likely overfit. An example is Naive Bayes.
For a large training set, a model with low bias and high variance models is better, as it expresses more complex relationships. An example is Logistic Regression.
3. Explain the ROC Curve and AUC.
The ROC curve is a graphical representation of the performance of a classification model at all thresholds. It has two thresholds: true positive rate and false positive rate.
AUC (Area Under the ROC Curve) is, simply, the area under the ROC curve. AUC measures the two-dimensional area underneath the ROC curve from (0,0) to (1,1). It used as a performance metric for evaluating binary classification models.
4. Explain LDA for unsupervised learning.
Latent Dirichlet Allocation (LDA) is a common method for topic modeling. It is a generative model for representing documents as a combination of topics, each with their own probability distribution.
LDA aims to project the features of higher dimensional space onto a lower-dimensional space. This helps to avoid the curse of dimensionality.
5. How do you ensure you are not overfitting a model?
There are three methods we can use to prevent overfitting:
- Use cross-validation techniques (like k-folds cross-validation)
- Keep the model simple (i.e. take in fewer variables) to reduce variance
- Use regularization techniques (like LASSO) that penalize model parameters likely to cause overfitting
6. In SQL, how are primary and foreign keys related?
SQL is one of the most popular data formats used in ML, so you need to demonstrate your ability to manipulate SQL databases.
Foreign keys allow you to match and join tables on the primary key of the corresponding table.
If you encounter this question, answer the basic concept, and the explain how you would set up SQL tables and query them.
7. What evaluation approaches would you use to gauge the effectiveness of an ML model?
First, you would split the dataset into training and test sets. You could also use a cross-validation technique to segment the dataset. Then, you would select and implement performance metrics. For example, you could use the confusion matrix, the F1 score and accuracy.
You’ll want to explain the nuances of how a model is measured based on different parameters. Interviewees that stand out take questions like these one step further.
8. Explain how to handle missing or corrupted data in a dataset.
You need to identify the find data and drop the rows/columns, or replace them with other values.
Pandas provides useful methods for doing this:
dropna(). These allow you to identify and drop corrupted data. The
fillna() method can be used to fill invalid values with placeholders.
9. Explain how you would develop a data pipeline.
Data pipelines enable us to take a data science model and automate or scale it. A common data pipeline tool is Apache Airflow, and Google Cloud, Azure, and AWS are used to host them.
For a question like this, you want to explain the required steps and discuss real experience you have building data pipelines.
The basic steps are as follows for a Google Cloud host:
- Sign into Google Cloud Platform
- Create a compute instance
- Pull tutorial contents from GitHub
- Use AirFlow for an overview of the pipeline
- Use Docker to set up virtual hosts
- Develop a Docker container
- Open Airflow UI and run the ML pipeline
- Run the deployed web app
10. How do you fix high variance in a model?
If the model has low variance and high bias, we use a bagging algorithm, which divides a data set into subsets using randomized sampling. We use those samples to generate a set of models with a single learning algorithm.
Additionally, we can use the regularization technique, in which higher model coefficients are penalized to lower the complexity overall.
11. What are hyperparameters? How do they differ from model parameters?
A model parameter is a variable that is internal to the model. The value of a parameter is estimated from training data.
A hyperparameter is a variable that is external to the model. The value cannot be estimated from data, and they are commonly used to estimate model parameters.
12.You are working on a dataset. How do you select important variables?
- Remove correlated variables before selecting important variables
- Use Random Forest and a plot variable importance chart
- Use Lasso Regression
- Use linear regression to select variables based on p values
- Use Forward Selection, Stepwise Selection, and Backward Selection
13. How do you choose which algorithm to use for a dataset?
Choosing an ML algorithm depends of the type of data in question. Business requirements are necessary for choosing an algorithm and building a is to build a model as well, so when answering this question, explain that you need more information.
For example, if you data organizes in a linear fashion, linear regression would be a good algorithm to use. Or, if the data is made up of non-linear interactions, a bagging or boosting algorithm is best. Or, if you’re working with images, a neural network would be best.
Read more about the top 10 ML algorithms for data science in 5 minutes
14. What are advantages and disadvantages of using neural networks?
- Store data on the entire network rather than a database
- Parallel processing
- Distributed memory
- Provides great accuracy even with limited information
- Requires complex processors
- Duration of a network is somewhat unknown
- We rely on error value too heavily
- Black-box nature
15. What is the default method for splitting in decision trees?
The default method is the Gini Index, which is the measure of impurity of a particular node. Essentially, it calculates the probability of a specific feature that is classified incorrectly. When the elements are linked by a single class, we call this “pure”.
You could also use Random Forest, but the Gini Index is preferred because it isn’t computationally intensive and doesn’t involve logarithm functions.
Additional intermediate questions may include:
What is a Box-Cox transformation?
Water Tapping problem
Explain the advantages and disadvantages of decision trees.
What is the exploding gradient problem when using back propagation technique?
What is a confusion matrix? Why do you need it?