Before talking about generalization in machine learning, it’s important to first understand what supervised learning is. To answer, supervised learning in the domain of machine learning refers to a way for the model to learn and understand data. With supervised learning, a set of labeled training data is given to a model. Based on this training data, the model learns to make predictions. The more training data is made accessible to the model, the better it becomes at making predictions. When you’re working with training data, you already know the outcome. Thus, the known outcomes and the predictions from the model are compared, and the model’s parameters are altered until the two line up. The aim of the training is to develop the model’s ability to generalize successfully.
What is generalization?
The term ‘generalization’ refers to the model’s capability to adapt and react properly to previously unseen, new data, which has been drawn from the same distribution as the one used to build the model. In other words, generalization examines how well a model can digest new data and make correct predictions after getting trained on a training set.
How well a model is able to generalize is the key to its success. If you train a model too well on training data, it will be incapable of generalizing. In such cases, it will end up making erroneous predictions when it’s given new data. This would make the model ineffective even though it’s capable of making correct predictions for the training data set. This is known as overfitting. The inverse (underfitting) is also true, which happens when you train a model with inadequate data. In cases of underfitting, your model would fail to make accurate predictions even with the training data. This would make the model just as useless as overfitting.
The ideal solution
You would ideally want to choose a model that stands at the sweet spot between overfitting and underfitting. To achieve this goal, you can track the performance of a machine learning algorithm over time as it’s working with a set of training data. You can plot both the skill on the training data and the skill on a test dataset that you’ve held back from the training process. As the algorithm learns over time, the level of error for the model on the training data would decrease and so would the error on the test dataset. Training the model for too long would cause a continual decrease in the performance on the training dataset due to overfitting. At the same time, due to the model’s decreasing ability for generalization, the error for the test set would start to increase again. The sweet spot is the point just before the error on the test dataset begins to rise where the model shows good skill on both the training dataset as well as the unseen test dataset.
To limit overfitting in a machine learning algorithm, two additional techniques that you can use are:
- Using a resampling method to estimate the accuracy of the model
- Holding back a validation dataset
So, during your machine learning training, keep an eye on generalization when estimating your model accuracy on unseen data.