Model Selection — Yacht Hydrodynamics Data Set (Statistical Method)

Source: Deep Learning on Medium

Model Selection — Yacht Hydrodynamics Data Set (Statistical Method)

Yacht Hydrodynamics

The data set that I am using is from UCI Machine Learning Repository. This data set is pretty famous in machine learning studies, also, it has been there for a really long time. You can easily found many research papers about these data sets. I am going to introduce my approximation model and how I evaluate different models. I only use traditional statistical method in this article, so no fancy machine learning or deep learning terminology, hope you enjoy, besides, thanks for my teammates: Rabin Duran Pons, Cynthia Xing and Wade Liang. (Github Repo for this project)

Before modeling, let’s check the data set:

Data Set Information — 6 explanatory variables and 1 response variable
Detailed Information of Dataset

So, we need to use the first 6 variables to make an approximate value of V7, which stands for Residuary resistance per unit weight of displacement.

Import and Forward Selection

Import library to the R studio, and use forward selection to rank the influence of each variables, you can easily found out that V6 ranks first and then followed by V2.

Then next step, let us plot V6 vs. V7. Also, the output plot shows a positive correlation between V6 and V7.

Before we proceed to regression part, let me declare my hypothesis: in short, my null hypothesis is all coefficient in my regression equation all equal to 0, which also means V1 to V6 has no effect to our response variable — V7.

Null Hypothesis — Beta 0 to 7 all equal to 0

Start to test my first simple model(model #1), V7 ~ V6 + V2:

Model #1: V7 = b0 + b1*V6 + b2*V2

From summary output, you can find the V2 Pr(>|t|) is about 0.395 which is larger than 0.025 or 0.05, which means is not significant, and we cannot reject our null hypothesis, but for V6’s coefficient is very significant, which means we should keep V6 but remove V2 from our prediction model. Besides, the normal Q-Q Plot shows the data set is not completely fitted to our Q-Q line and residual plot shows a pattern. The residual plot pattern indicate that V7 and other variables has higher order relationship, so this gives me some hints for my next model — log V7.

The result of my 2nd model — Deduced from the model #1:

Model #2: log(V7) = b0 + b1*V6

Wow ! We are on the right track ! The residual plot is not perfect but looks much better than the model #1, also all variables are significant from the R output. and the adjusted R-square improved from 0.65 -> 0.76. However, from the residual plot, I got a feeling, the higher order polynomial may cancel the effects of wave. So let me try my model #3.

The result of my 3rd model — Deduced from the model #2:

Model #3: log(V7) = b0 + b1*V6 + b2*(V6)²

Emmmm….. the residual plot still have a wave pattern, but from adjusted R², this model is actually better than model #2, adding more terms will definitely increase the R², so in this case checking adj R² is better. Also, from the summary output, the coefficient before the V6² is also significant, so let me try even higher polynomial order.

The result of my 4th model — Deduced from the model #3:

Model #4: log(V7) = b0 + b1*V6 + b2*(V6)² + b3*(V6)³ + b4*(V6)⁴

Cool ! This is so far the best model, we successfully cancelled the wave trendency in residual plot, also, from coefficients, b0 to b4 all very significant and valid. Adj R² is about 0.9047 outperform the model #3.


Model Selection Workflow in Conclusion

The first one is model #1, followed by #2 #3 and #4, we can find out that AIC decreased a lot from originally 2224 to 30. One interesting thing from model #3 to model #4, some people may say its not significant, but from AIC you can easily find out that the performance increased a lot !

Here is something I learned in this model exploration:

  1. Adj R² is not going to be very significant between really good models, should consider to use AIC or Cp, which is more significant.
  2. At the beginning stage, use forward selection or backward selection can help to figure out the most significant variables and from that we can do some model tests.

Here is my Github Repository:

If you found some problems in my posts or something confused you, please leave me a message in response or create an issue ticket in Github.

Thanks again for your reading !