4.4. Codeless advice#

4.4.1. Best overfitting advice#

This is the best advice I read on combatting overfitting:

“To achieve the perfect fit, you must first overfit”.

Here are the reasons why:

First, it makes sense - you can’t fight overfitting without a model that overfits.

Second, it is a sign of power - if a model is overfitting or perfectly memorizing the training data, it is a sign that model has enough optimization power to learn the patterns in the training data.

Solving ML problems is all about the tension between optimization (how well the model learns from training data) and generalization (how well the model performs on unseen data).

After you can build a model that is able to overfit, you should focus on generalization because too much optimization hurts it. You should try less complex model architectures, apply regularization, add random dropout layers (DART trees of XGBoost or DropOut layers in TensorFlow) to tune optimization and increase generalization.

You won’t be able to do any of them unless you have a model that overfits.

4.4.2. Why beginners won’t do LR and keep choosing XGBoost#

Last year, I saw that a tabular competition on Kaggle was won by an ensemble of Quadratic Discriminant Analysis models. What is QDA, you ask? I had no idea either.

It was a very eye-opening experience for me as a beginner, because I have thought having learned XGBoost, I could just ignore any other older models.

I was disctracted by the hot tools. Turns out, it isn’t about the tool but how quickly and efficiently you can solve a problem.

Later, I found that for that particular competition’s data, QDA was orders of magnitude faster than any tree-based models and could easily beat them in terms of performance.

So, the moral here is that don’t approach problems with tools-first mindset. Rather, find the best way to solve it in the simplest way possible. Don’t try to look “cool” by using whatever is being popular at the time.

4.4.3. Should you always cross-validate?#

Is it a requirement to use cross-validation every time? The answer is a tentative “Yes”.

When your dataset is sufficiently large, every random split of train/test sets should resemble the original data well. However, each model comes with its inherent bias and it will have samples that it favors over others.

That’s why it is always recommended to use CV techniques. Even when the data is large, you should at least go for 2-3 fold CV.

As the dataset size gets smaller, you can increase the folds. When it is dangerously small, like below 100 rows, you can go for extreme CV techniques such as LeaveOneOut or LeavePOut.

I have talked about CV techniques in detail in one of my recent articles. Give it a read!

https://bit.ly/3z5e02c

4.4.4. Why does ensembling work better than single models?#

Why does ensembling work better than single models?

########

Reason 1

########

Members of the ensemble learn different mapping functions from input to output. A good ensemble contains members with as different learning functions as possible that explore the information space created by the data from all angles. They make different assumptions about the structure and make errors in different cases.

########

Reason 2

########

The predictions are always combined in some way. This allows the ensemble to exploit the differences of predictions in all members. In other words, you don’t just have to take the word of one model but get a collective opinion on each case, lowering the risk of making an inaccurate prediction.

########

Reason 3

########

There is also a beautiful probabilistic reason why ensemble of models with different scores beat another set of models with similar scores. The prove is a bit long but I will definitely talk about it next week.

We have a heated debate on whether the benefits gained from ensembles outweigh their advantages but that’s also topic for another post.

4.4.5. How to get a total control over randomness in Python#

How do you get a total control over the randomness in your scripts and notebooks? It is not by using np.random.seed!

According to Robert Kern (a major NumPy contributor) and the Sklearn official user guide, you should use RNG instances for totally reproducible results.

You should replace every mention of “random_state=None” with an instance of np.random.RandomState so that results across all script runs across all threads share the same random state. The behavior of RandomState (RNG) instances is particularly important when you use CV splitters.

You can read more about this from a StackOverflow discussion or a pretty detailed guide on controlling randomness by Sklearn:

SO thread: https://bit.ly/3A2hW5i Sklearn guide: https://bit.ly/3SwbLh9