PSO-Optimized GB, RF, SVM, and KNN Hybrid Model Developed in Vietnam for Shallow Foundation Analysis

A shallow foundation is a substructure designed to transmit loads from the superstructure to the soil
surface. A foundation is classified as shallow when its width is greater than its depth. Foundations are critical structures that must be designed with careful consideration for safety, strength, durability, and settlement. During construction, it is important to monitor settlement, as it directly affects the safety of the project. Foundation settlement is influenced by various factors, including geological conditions, load distribution, the magnitude of the load, and the bearing capacity of the structure. There are several methods to study ground layer settlement, which can be broadly categorized into the class sum method, finite element method, calculation method, empirical inference method, and combined prediction method.

With advancements in machine learning techniques, many researchers have started to implement these methods, including support vector machines (SVM), artificial neural networks (ANN), adaptive neuro-fuzzy inference systems (ANFIS), genetic programming (GP), simulated annealing (SA), and particle swarm optimization combined with SVM (PSO-SVM).

The study conducted by Thi Thanh Huong Ngo et al., from the Faculty of Civil Engineering at the University of Transport Technology in Thanh Xuan, Hanoi, Vietnam, focused on predicting foundation settlement using several models, including Random Forest (RF), Support Vector Machine (SVM), Gradient Boosting (GB), and K-Nearest Neighbors (KNN). These models were optimized using Particle Swarm Optimization (PSO) to achieve the best hyperparameter settings.

The dataset consisted of 189 samples and included various input features: footing embedment ratio (D/B),
footing width (B), footing geometry (L/B), depth of the water table (d), net applied pressure (q), and average Standard Penetration Test (SPT) blow count. The output variable was foundation settlement (S). The dataset was divided into 70% for training and 30% for testing.

Before conducting the analysis, a review of outliers was performed using box plots. These visualizations indicate that the settlement results exhibit variability in certain regions, suggesting a skewed data distribution characterized by numerous outliers. If not addressed appropriately, these outliers may adversely affect the performance of machine learning models. However, when identified and managed effectively, outliers can contribute positively to the robustness of the models. Therefore, this study adopts
a strategic approach to utilize outlier data by implementing machine learning algorithms such as Random Forest (RF) and Gradient Boosting (GB), which are known for their resilience to outliers due to their tree-based methodologies.

The study employed k-fold cross-validation, a robust statistical method utilized for comparing and identifying the optimal model for a specific problem. In this approach, the training dataset is partitioned into k subsets, or folds. A model is developed using data from k-1 folds, while the remaining fold serves as the validation set for testing purposes. This methodology facilitates the generation of model variations through the utilization of diverse parameter sets. The process is repeated k times, with each iteration excluding a different fold for model development. For this study, 10-fold cross-validation was implemented to ensure thorough and reliable validation of the model’s performance.

Shapley Additive Explanation (SHAP) is a machine learning technique used for model interpretation,
particularly to assess the impact of individual samples in complex models. It highlights the contribution of each independent variable to a prediction, providing clear visual explanations for specific data points. Unlike traditional algorithms, SHAP offers easily interpretable visualizations, which enhance understanding of a model’s reliability.

SHAP is particularly useful for analyzing the complexity of the model’s prediction results and the influence of independent variables on the settlement of shallow foundations. The SHAP analysis was conducted using Random Forest (RF) and Gradient Boosting (GB) models. The results indicate that, for both models, the average Standard Penetration Test (SPT) blow count feature has the most significant impact. As SPT values increase, their effect on settlement generally decreases. Additionally, the width of the foundation (B) shows a notable influence in both models, while factors such as the depth of the water table (d) have minimal impact.

The authors employed Particle Swarm Optimization (PSO) to fine-tune the hyperparameters of various machine learning models, enhancing their performance. For the Gradient Boosting (GB)
algorithm, several hyperparameters were optimized:

1. Number of Trees: This parameter, which ranges from 1 to 40, dictates the model’s boosting stages. Each tree in the ensemble contributes to improving the overall prediction quality.

2. Learning Rate: This hyperparameter, set between 0.3 and 0.5, controls the contribution of each individual tree to the final prediction. A lower learning rate means that the model adjusts its predictions more conservatively, requiring more trees to achieve optimal performance.

3. Maximum Features: This parameter, ranging from 1 to 4, determines the number of features considered at each split in the decision trees. It effectively controls how the model examines the data during the fitting process.

4. Maximum Depth: This hyperparameter, defined between 1 and 5, restricts the maximum depth of each
tree, preventing overfitting by limiting how complex the model can become.

5. Minimum Samples Required to
Split a Node: This value, which ranges from 0.02 to 0.09, indicates the minimum number of samples that must be present in a node before it can be split further, ensuring that splits occur only when the node contains a sufficient number of observations.

6. Minimum Samples Required to
Form a Leaf Node: Similar to the previous parameter, this one also varies between 0.02 and 0.09, establishing the minimum number of samples necessary to create a leaf node in the tree structure.

For the Random Forest (RF)
algorithm, the hyperparameters that were optimized included:

1. Number of Trees: Like GB, this parameter varies from 1 to 40. Generally, more trees improve the model’s performance by averaging the predictions of the individual trees.

2. Maximum Features: Also
ranging from 1 to 4, this parameter governs how many features are considered at each split, helping to improve model diversity and reduce overfitting.

3. Maximum Depth: This hyperparameter, limited to between 1 and 5, defines the maximum depth allowed for each tree in the forest, controlling model complexity.

4. Minimum Samples for Splits and Leaves: Both parameters have values ranging from 0.02 to 0.09, ensuring that there are enough samples to make valid decisions during splitting and to form leaf nodes.

For the Support Vector Machine
(SVM) algorithm, the following hyperparameters were fine-tuned:

1. Regularization Parameter (C): This parameter, which ranges from 1 to 250, influences the balance between maximizing the margin between data classes and minimizing classification errors. A high value of C prioritizes minimizing errors, potentially at the cost of margin width.

2. Kernel Coefficient (γ):
Set between 0.1 and 10, this parameter determines the influence of individual data points. A small γ leads to a flat decision boundary, while a large γ can create very complex boundaries.

3. Kernel Type: The type of kernel used can significantly affect model performance. Options include Polynomial, Radial Basis Function (RBF), Sigmoid, and linear. Each kernel type has its own characteristics and suitability depending on the underlying data distribution.

Lastly, for the K-Nearest
Neighbors (KNN) algorithm, the hyperparameters optimized included:

1. Number of Neighbors (K): This parameter varies from 1 to 10 and sets the number of reference points used to predict the classification of a new data point. A smaller K can be noisy and lead to overfitting, while a larger K can smooth the prediction.

2. Leaf Size: Ranging from 1 to 5, this parameter affects the efficiency of the search in the tree structures used by the KNN algorithm, influencing the speed at which the nearest neighbors can be identified.

3. Power Parameter (p): This parameter ranges from 1 to 5 and defines the distance metric used. Different values correspond to different types of distance calculations, such as Manhattan, Euclidean, or Minkowski distances.

4. Algorithm Type: The options for optimizing neighbor searches include Auto, Ball Tree, kd_tree, and Brute. Each algorithm has advantages depending on the dataset size and dimensionality.

The optimized hyperparameters obtained are as follows:

For the Gradient Boosting (GB) model:

– Number of trees: 11

– Learning rate: 0.45

– Maximum features: 2

– Maximum depth: 2

– Minimum samples required: 0.447

– Minimum samples required to form a leaf: 0.0351

For the Random Forest (RF) model:

– Number of trees: 7

– Maximum features: 4

– Maximum depth: 3

– Minimum samples split: 0.025

– Minimum samples leaf: 0.06

For the Support Vector Machine (SVM) model:

– Regularization parameter (C): 199

– Kernel coefficient (gamma): 6.71

– Kernel type: radial basis function (RBF)

For the K-Nearest Neighbors (KNN) model:

– Number of neighbors: 4

– Leaf size: 2

– Power parameter: 1

Based on the optimal hyperparameters obtained, the models were evaluated and compared primarily
using the R-squared metric. The results indicated that the Gradient Boosting with Particle Swarm Optimization (GB-PSO) model demonstrated the highest accuracy, achieving an R-squared value of 0.805. Furthermore, additional comparisons were made among the Gradient Boosting (GB), Random Forest (RF), Support Vector Machine (SVM), and K-Nearest Neighbors (KNN) models. It was found that the GB model outperformed all others, attaining an impressive R-squared value of 0.948.

In summary, the GB model is the most effective predictive model for shallow foundation settlement, exhibiting high accuracy and low error metrics in the training and testing datasets. The RF model ranks as the second-best option, albeit with slightly higher error rates. The GB-PSO and KNN-PSO models were less effective, with KNN-PSO demonstrating the lowest accuracy. In light of these findings, the GB model is recommended as the most reliable choice for accurate settlement prediction.

Methodology of the study

(Source : https://doi.org/10.1177/00368504241302972)

Reference

Huong Ngo, T. T., & Tran, V. Q. (2024). Predicting and evaluating settlement of shallow foundation using machine learning approach. Science Progress, 107(4), 00368504241302972.

Leave a Reply Cancel reply