Lessons from fastai Machine Learning

Source: Deep Learning on Medium

I watched all the lessons after a long time from [this post](https://medium.com/@crcrpar/what-i-learned-from-fast-ai-ml-till-5-510040c6d91f). However, the last couple of lessons had some contents like Kaggle Rossman competition that were done in Deep Learning course. So, I did not take notes so much.

The most impressive contents are
* Implement Random Forest from scratch using numpy and optionally Cython. Here, I think we can use CuPy: A NumPy-compatible matrix library accelerated by CUDA.
* Random Forest interpretation: feature importance, partial dependence, tree interpreter type 2 (explain one prediction), and extrapolation.
As to interpretation, I think Christoph Molnar’s “Interpretable Machine Learning” is must-read because the book covers conventional machine learning and methods explaining neural networks’ predictions.

But, because I’ve been busy so I copy&paste my log here. I might elaborate on this post afterward.
Also, the last couple of lessons did some contents done in fast.ai Deep Learning course.

What Impressed Me Most

One lesson from the last half is what really matters is feature importance, not AUC score nor accuracy. And another lesson is Random Forest returns the average of neighbor points in the tree space. So, if the inputs are distant from the space, predictions should be the average of whole training dataset samples.

Hereafter, I just copy my notes. So feel free to stop reading :)
I’ve been busy so I copy&paste my log here. I might elaborate on this post afterward.
Also, the last couple of lessons did some contents done in fast.ai Deep Learning course.

Lesson 6

Why do I need machine learning?
the drivetrain method
Examples of how ppl use machine learning in business (in slide).
Horizontal applications.

Churn: to predict who is going to leave. 
jeremy howard data products book would be interesting.
defined objective, levers, data, models.
levers = what inputs can we control
data = what data we can collect
models = how the levers influence the objective
levers for churn prediction is…
 motivate users not to leave the service?
 change the prices?
clarify what we can actually do!
after this, clarify what data is available or necessary.
In practice, care more about simulation.
 build a simulation model;
 predict what happens by what the model predicts.
~~optimization model basically ~~
predictive model goes into simulation model giving the predictions.
simulation model predicts the probability the target changes his/her behavior by the action we made.

about interpretation more of prediction.
Use feature importance to decide the next action!

In business, what really matters is feature importance, i.e., understanding not AUC score.

— –

vertical applications
readmission risk
a predictive model is helpful of course but feature importance would play a role.
you can build a chart w/o machine learning, but if with machine learning and its feature importance, the chart will be much improved and help decision making.

there is still skepticism from unfamiliarity with the approach to data.

— — break — -
random forest interpretation.
confidence based on tree variance.

how to calculate feature importance for a certain feature. type 1
how to calculate from a trained random forest?
– shuffle randomly the column and calculate the score and the gain.
jeremy looks at the relative difference.
The scalars themselves are not important to him?
also plots of gain is helpful. plateau of low values features would be not helpful.

Partial Dependence.
there always be a bunch of interactions of different features.
So 2D plot cannot describe this and would be a big problem.
How to calculate?
by leaving every other features as is and replacing the values with a single value, and the calculate the prediction. Repeat this!
partial dependence plot tells the underlying truth.

Tree Interpreter type 2
feature interpretation for a specific observation
like a waterfall

(live coding)
gain from multiple enclosures is the interaction of them.
RF just returns the average of neighbor points in the tree space.
If the inputs are really far from the samples in training dataset,
it just returns the average of the whole training dataset.
ATM, no way to handle this, but there are time series analysis and neural net.

Lesson 7

random forest and neural nets are 2 keys.
a lot of progress has been made in decision tree based methods like random forest and GBM.
RF is harder to screw up than GBM.

22 observations. t-distribution turns to normal distribution.
standard error = std / sqrt(n)

oversampling till the number of instances in each class is the equal to the most common class is the right thing to o.
Or, stratified sampling to create mini batches.

— –

bulldozer kaggle competition.
sample rows w/ replacement

Decision Tree doesn’t have randomness.
Randomness happens in creating a bunch of decision trees in Random Forest, i.e., choosing indexes.

how to find variables to split in decision tree?
lhs.std() * lhs.score + rhs.std() * rhs.score()
O(N) implementation using sqrt( x**2 — mean(x) ** 2 / N)

class A
 def foo(self, …)
def foo(self, …)
A.foo = foo

Start from assumption and assuming I’m wrong in coding.
ternary operator is helpful.

sklearn’s Random Forest is written in Cython.
First time, it shoulld be slower.

Working with NumPy (Cython docs)

RF is a nearest neighbor methods.

Lesson 8

pickle works for nearly every python object but not optimally.
pickle files are only for Python.

In random forest, normalization to independent variables doesn’t matter.
The order does matter. Random Forest ignores the scale or statistical distribution problems.