Improving the model performance using the Ensemble Techniques: Bagging in Depth

Bagging In Depth:

Definition: Bagging is an ensemble technique that helps in improvement of performance of the model. It helps in reducing the variance without affecting the bias of the prediction model which means it reduces the overfitting of the model thus generalizing the model very well. It is used for both regression and classification models. Bagging name is derived from word ‘Bootstrap aggregating.’

Importance of Ensemble Techniques:Ensemble techniques help in increasing the model performance especially in terms of reduction of model’s variance. They help to reduce the spread in average prediction of a model.

There are four types of ensemble techniques:

1.Bagging

2.Boosting

3.Stacking

4.Cascading

The base model (weak learner) chosen in bagging has low bias and high variance for example we can consider Decision tree as base model. Random Forest algorithm an ensemble technique multiple decision tree models are trained on data and majority of vote is performed for classification and mean or median is calculated for regression problems.

Bootstrap sampling or bootstrapping in detail:

Bootstrap sampling means sampling both rows and columns of a dataset.

Step1 – Creating samples.

Randomly create m samples from the whole dataset (number of rows = n)

Creating each sample: Consider any random 60% of data points from whole data set and then replicate any 40% of points from the sampled points.
Ex: For better understanding of this procedure let’s check this examples, assume we have 10 data points [1,2,3,4,5,6,7,8,9,10], first we take 6 data points randomly consider we have selected [4, 5, 7, 8, 9, 3] now we will replicate 4 points from [4, 5, 7, 8, 9, 3], consider they are [5, 8, 3,7] so our final sample will be [4, 5, 7, 8, 9, 3, 5, 8, 3,7]
Create 30 samples like this.
Note that as a part of the Bagging when you are taking the random samples make sure each of the samples will have different set of columns.
Ex: assume we have 10 columns for the first sample we will select [3, 4, 5, 9, 1, 2] and for the second sample [7, 9, 1, 4, 5, 6, 2] and so on…
Make sure each sample will have at least 3 features/columns/attributes.

Step – 2 – Building High Variance Models on each of the sample and finding train MSE value

1.Build a regression tree on each of the samples.

2.Compute the predicted values of each data point (n data points) in your corpus.

3.Predict y^ on i th data point

4.Calculate MSE

Step – 3: Calculating the OOB score

1.Predict value xi with kth model. (Model which is not trained on xi)

2. Calculate OOB score

OOB (out-of-bag) score is a performance metric for a machine learning model, specifically for ensemble models such as random forests. It is calculated using the samples that are not used in the training of the model, which is called out-of-bag samples. These samples are used to provide an unbiased estimate of the model’s performance, which is known as the OOB score.OOB error is used to tune the hyperparameter of a model. By using the OOB error as a performance metric, the hyperparameters of the model can be adjusted to improve its performance on unseen data.

It is also used to check if the model is overfitting and underfitting.

Refer to these images to better understand 1,2 and 3.

Building regression trees: https://i.imgur.com/pcXfSmp.png

OOB score calculation :https://i.imgur.com/95S5Mtm.png

Calculate MSE:https://i.imgur.com/sPEE618.png

Bootstrapping with Decision tree as base model (without using sklearn):

Build Decision Tree Regressor model.

Predict value xi with kth model (model which is not trained on xi datapoint)

Prepare data according to the step 1

Train and run time complexity of Random Forest algorithm:

Training time : O(n*lg(n) * d * k)

where n = size of dataset (number of rows)

k = number of base learners

d = number of features or columns in dataset

Space complexity: O(space required to store a decision tree * k)

Where k = number of base learners

Advantages:

Bagging takes advantage of ensemble learning wherein multiple weak learners outperform a single strong learner.

Bagging trees allow the trees to grow without pruning, reducing the tree-depth sizes and resulting in high variance but lower bias, which can help improve predictive power.

Disadvantages:

There is a loss of interpretability of the model.

There can possibly be a problem of high bias if not modeled properly.

While bagging gives us more accuracy, it is computationally expensive and may not be desirable depending on the use case.

Differences between Bagging and Boosting:

Bagging helps in reducing the variance of the model

Boosting helps in reducing the bias of the model

Each base model is trained parallel fashion

Models are trained in sequential manner and model at each step is influenced by previous model

Different training data subsets are selected using row sampling with replacement and random sampling methods from the entire training dataset.

Every new subset contains the elements that were misclassified by previous models.

Applications of Bagging:

1.Bagging is used in both classification and regression tasks.

Classification tasks like sentiment analysis, spam email detection etc.

Regression tasks like stock market forecasting, housing price estimation etc.:

2.Bagging can be applied to anomaly detection tasks, where the goal is to identify rare and unusual events, such as fraud detection in financial transactions or network intrusion detection.

Future trends in Bagging:

Bagging techniques could be explored in the context of deep learning, where the application of ensembles is still an emerging area. Combining bagging with neural networks might offer improvements in both predictive performance and model interpretability.As model interpretability becomes crucial in various domains, there might be efforts to make bagging models more interpretable. This could involve techniques to explain the contributions of individual base models within the ensemble.

What’s your Reaction?