Imagine driving over a bridge, a silent giant of concrete and steel supporting your journey. We trust these structures implicitly, but like anything exposed to the elements and the relentless pressure of traffic, they can develop hidden problems. Cracks, spalling (where the concrete surface flakes off), exposed reinforcing bars – these are just some of the ailments that can plague our bridges, impacting their durability, reliability, and most importantly, our safety.
For decades, ensuring the health of these vital infrastructures has largely relied on manual inspection. Picture engineers and inspectors meticulously examining every inch of a bridge, often in precarious positions, taking notes and photographs. While their expertise is invaluable, this method is subjective, risky, labor-intensive, and ultimately, slow and inefficient. Consider the sheer scale: by the end of 2023, China alone had over a million highway bridges spanning an astounding 95 million meters. Inspecting such a vast network with traditional methods presents a monumental challenge.
Thankfully, a new era of bridge inspection is dawning, powered by the remarkable advancements in artificial intelligence (AI) and image processing. The idea is simple yet powerful: train computers to “see” and identify these subtle signs of distress in images of bridge surfaces, automating the detection process and freeing up human experts to focus on critical analysis and repair decisions.
Early attempts to automate disease detection in bridge images involved traditional image processing techniques. Methods like Canny edge detection were used to try and highlight cracks, while Otsu threshold segmentation and its variations aimed to separate cracked areas based on pixel intensity. Even multi-scale centerline detection and the maximum entropy method were explored for crack identification.
These early methods had their merits – they were often simple to implement and had clear physical meanings. However, they were also easily fooled by noise in the images and primarily effective for diseases with distinct edges, like cracks. They struggled with area-based diseases like spalling or efflorescence (the white powdery deposit you sometimes see on concrete) and other more complex issues where the boundaries aren’t so clear.
The limitations of these traditional approaches paved the way for the rise of deep learning in bridge surface disease detection. Deep learning models, inspired by the structure of the human brain, can learn complex patterns from vast amounts of data, achieving high performance and strong flexibility. Object detection frameworks like Fast-RCNN, YOLOv3, YOLOv5, and DETR were adapted to identify various bridge diseases like cracks, spalling, efflorescence, corrosion stains, and exposed rebar. These methods could quickly locate the diseases, but often couldn’t provide detailed information like the exact area affected.
This is where semantic segmentation comes into play. Instead of just drawing a box around a problem, semantic segmentation aims to classify every single pixel in an image into predefined categories (e.g., healthy concrete, crack, spalling). This provides a much more detailed and precise understanding of the extent and location of the damage, allowing for better quantification and informed repair planning. Early semantic segmentation approaches used Fully Convolutional Networks (FCNs) with pre-trained architectures like VGG to identify delamination (separation of concrete layers) and rebar exposure. Other networks like SCCDNet focused on pixel-level crack segmentation using efficient convolutional techniques, and transformer-based methods like one using Swin-Transformer even aimed to quantify crack widths from drone images.
While these semantic segmentation methods showed promise, they still faced challenges. Issues like loss of local detail, a large number of parameters (making them computationally heavy), and slow inference speeds limited their real-world applicability, especially on edge devices like drones with limited processing power. Furthermore, many algorithms focused on identifying only a single type of disease, whereas in reality, bridges often suffer from multiple issues simultaneously, and these different diseases with similar visual characteristics can interfere with each other, making accurate identification even harder.
To address these critical challenges, a team of researchers proposed a lightweight semantic segmentation method for concrete bridge surface diseases based on an improved DeeplabV3+ architecture. Their goal was to create an AI model that was not only highly accurate in identifying multiple types of bridge diseases but also computationally efficient enough for real-time deployment on practical inspection tools.
Let’s delve into the key innovations they introduced:
- A Lighter and Faster Brain: The Improved MobileNetV3 Backbone
The foundation of any deep learning model for image analysis is its backbone network, responsible for extracting meaningful features from the input image. The original DeeplabV3+ used a powerful but computationally intensive backbone called Xception. While Xception is accurate, its large number of parameters makes the model slow and requires significant computational resources. This isn’t ideal for deployment on resource-constrained devices.
Therefore, the researchers replaced Xception with a lightweight backbone network called MobileNetV3. MobileNetV3 is specifically designed for mobile and embedded devices, employing clever techniques to reduce computational complexity without sacrificing too much accuracy. It utilizes:
- Depthwise separable convolutions: These convolutions break down the standard convolution operation into two lighter steps, significantly reducing the number of calculations. Think of it like separating the “what” and the “where” of the features being learned.
- Inverted residual structures with linear bottlenecks: These structures help to efficiently learn and represent features using fewer parameters.
- h-swish activation function: Activation functions introduce non-linearity into the network, allowing it to learn complex relationships. While the swish activation function can improve accuracy, it’s computationally expensive. MobileNetV3 uses the h-swish function, which has a similar effect but is much faster to compute.
- SENet (Squeeze-and-Excitation Network) attention module: This module helps the network focus on the most important features by learning to weigh different channels in the feature maps.
However, the researchers didn’t just use the standard MobileNetV3. They made further improvements to tailor it for bridge disease detection:
- Replacing the SENet module with ECA-Net (Efficient Channel Attention Network): ECA-Net is a more efficient version of SENet. It avoids dimensionality reduction in the attention mechanism and facilitates cross-channel information exchange more effectively. The reasoning is that reducing the dimensionality in SENet could negatively impact the prediction results.
- Truncating time-consuming layers: By removing some of the more computationally expensive layers in the original MobileNetV3, they further reduced the model’s parameter size and accelerated training and prediction speeds.
The result of these changes was a significantly lighter and faster backbone network that could still effectively extract the necessary features for accurate disease identification. The improved MobileNetV3 achieved the lowest number of parameters (6.89 × 10^6) and the highest frames per second (FPS) at 54.52 compared to other backbones like Xception, EfficientNet V2, and Resnet101 when tested within the DeeplabV3+ framework. Think of this like swapping a powerful but gas-guzzling engine in a car for a more fuel-efficient yet still capable one – you get good performance with much less resource consumption.
- Seeing Details at Every Scale: The CSF-ASPP Module
After the backbone extracts the initial features, the Atrous Spatial Pyramid Pooling (ASPP) module in DeeplabV3+ plays a crucial role in capturing contextual information at different scales. The original ASPP uses parallel dilated convolutions with different dilation rates to expand the receptive field (the area of the input image that a neuron in the network “sees”). This is important because bridge diseases can vary significantly in size. A large crack needs a large receptive field to be understood as a continuous entity, while small areas of spalling require a finer-grained view.
However, the original ASPP had limitations:
- Its larger dilation rates could lead to the loss of detailed and spatial information, especially for small area diseases, which are common on bridge surfaces.
- Interference between multiple diseases could hinder the network’s ability to effectively mine contextual information in the deeper layers.
To overcome these limitations, the researchers designed a Cross Scale Fusion ASPP (CSF-ASPP) module with several key improvements:
- Cascading branches for cross-scale fusion: Inspired by the DenseNet architecture, they redesigned the relationship between the different branches in the ASPP module. Instead of just processing in parallel, the output of one dilated convolution branch is concatenated with the features from the backbone network and then fed into the next branch. This creates a cascading effect, allowing for a richer fusion of multi-scale features and enhancing the complementarity between them, leading to improved anti-interference ability when multiple diseases are present. Imagine different specialists in a team sharing their findings sequentially, building upon each other’s insights for a more comprehensive understanding.
- Adding a convolution branch and modifying dilation rates: They added an extra dilated convolution branch and changed the dilation rates of the original branches from (6, 12, 18) to (4, 8, 12, 16). This provides the module with a wider range of convolutional kernels of different scales, enabling it to extract more features relevant to diseases of varying sizes. It’s like having different magnifying glasses to examine both broad patterns and fine details.
- Replacing traditional convolutions with Depthwise Over-parameterized Convolutions (DO-Convs): DO-Conv adds an extra depthwise convolution to the existing convolutional layers. This improves the model’s feature representation ability without increasing the computational cost during inference (the prediction phase). DO-Conv can be converted back into traditional convolution operations after training. Essentially, it’s like using a more sophisticated training technique that leads to better feature learning without slowing down the real-time detection process.
Experimental results showed that the CSF-ASPP module significantly outperformed the original ASPP in terms of both mean Intersection over Union (mIoU) and mean Pixel Accuracy (mPA), demonstrating its superior ability in semantic segmentation of concrete bridge surface diseases.
- Focusing on the Real Problems: The Focal Loss Function
The final piece of the puzzle was addressing the issue of sample imbalance in the bridge disease dataset. In real-world bridge images, the area covered by diseases is typically much smaller than the area of healthy concrete. This means that during training, the AI model sees many more “normal” pixels than “disease” pixels. As a result, it can become biased towards correctly classifying the majority class (normal concrete) while neglecting the accurate identification of the minority classes (the actual diseases). Using a standard cross-entropy loss function (which measures the difference between the model’s predictions and the ground truth) doesn’t effectively address this imbalance.
To tackle this, the researchers adopted the focal loss function. Focal loss introduces a moderation factor that reduces the loss contribution from easy-to-classify samples (the abundant normal concrete pixels) and focuses the training process on difficult-to-classify samples and minority class samples (the disease pixels). This is achieved through a focusing parameter (gamma) and a weighting factor (alpha).
By giving more importance to the misclassified or underrepresented disease samples, the focal loss function helps the model learn to better identify these critical areas, even when they are rare in the training data. This makes the model more robust and adaptable in complex real-world scenarios where diseases might be small or have subtle appearances. Ablation experiments confirmed that using the focal loss function led to a further increase in the model’s mIoU and mPA.
To validate the effectiveness of their improved DeeplabV3+ model, the researchers conducted extensive experiments using a real-world dataset of concrete bridge surface disease images captured by a traffic inspection company. This dataset included images of four common diseases: spalling, exposed reinforcement rebar, efflorescence, and cracks. To ensure the model’s robustness, the dataset included images taken under different lighting conditions and shooting angles, and data augmentation techniques like flipping, rotation, and color space transformations were applied to increase the dataset size and diversity.
The experiments were performed on powerful hardware with an NVIDIA RTX4080 GPU. The model’s performance was evaluated using standard semantic segmentation metrics: Intersection over Union (IoU) for each class, mean IoU (mIoU) across all classes, Pixel Accuracy (PA) for each class, and mean Pixel Accuracy (mPA) across all classes. Additionally, they measured the number of parameters in the model (indicating its size and computational complexity) and the Frames Per Second (FPS) achieved during inference (indicating its speed).
The results were compelling:
- The improved DeeplabV3+ achieved an mIoU of 75.24% and an mPA of 84.68%, significantly outperforming the baseline DeeplabV3+ and other state-of-the-art semantic segmentation models like U-Net, HRNet, PSPNet, SETR, SegFormer, Mask2Former, and PIDNet. This means their model was more accurate in identifying and segmenting the different types of bridge diseases.
- The improved model showed a 90.33% reduction in the number of parameters compared to the original DeeplabV3+, shrinking its size dramatically. This makes it much more suitable for deployment on devices with limited computational resources.
- It also achieved a 36.22% improvement in FPS, reaching 52.64 FPS. This higher inference speed is crucial for real-time applications, such as when inspecting bridges using drones. Imagine a drone flying along a bridge, with the AI model instantly analyzing the video feed for signs of damage. A higher FPS allows for faster and more efficient inspection coverage.
Visual comparisons of the segmentation results further confirmed the superiority of the improved DeeplabV3+. It produced segmentation masks that were closer to the ground truth labels, with better edge detail detection and clearer segmentation outlines compared to other models.
While this study demonstrates remarkable progress, the researchers also acknowledge certain limitations. The dataset used primarily came from a single traffic inspection company and was concentrated in temperate monsoon climate regions. This means the model’s generalization ability to bridges in significantly different environments (like extreme cold or tropical zones) might be limited, as these environments can lead to different types of damage (e.g., freeze-thaw cracking, accelerated corrosion).
To address this, future research plans include:
- Collaborating with more bridge inspection agencies to collect a more diverse and representative dataset from different regions and bridges.
- Implementing an active learning framework to strategically select the most informative images for labeling, further improving the model’s learning efficiency and generalization.
- Deploying the model on concrete bridge inspection drones and implementing a multi-scale inference strategy to balance speed and accuracy based on the density of detected diseases.
- Leveraging drone metadata and Structure-from-Motion (SfM) techniques to achieve precise spatial calibration and convert the segmentation masks into quantitative parameters like crack width and spalling area, providing even more valuable insights for bridge maintenance decisions.
In conclusion, the development of this improved lightweight semantic segmentation method marks a significant leap towards a future where AI plays a crucial role in safeguarding our bridges. By combining innovations in backbone networks, feature extraction modules, and loss functions, this research has delivered a model that is not only highly accurate but also practical for real-world deployment. As data collection expands and the technology continues to evolve, we can expect AI-powered bridge inspection to become an indispensable tool for ensuring the safety and longevity of these vital structures, allowing us to drive over them with even greater peace of mind.
Reference
Yu, Z., Dai, C., Zeng, X. et al. A lightweight semantic segmentation method for concrete bridge surface diseases based on improved DeeplabV3+. Sci Rep 15, 10348 (2025). https://doi.org/10.1038/s41598-025-95518-5
Jiang, S. & Zhang, J. Real-time crack assessment using deep neural networks with wall-climbing unmanned aerial system. Comput.-Aided Civ. Infrastruct. Eng. 35 (6), 549–564 (2020).
Lin, T. Y. et al. Feature pyramid networks for object detection. In 30th IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 936–944 (2017).
Rubio, J. J. et al. Multi-class structural damage segmentation using fully convolutional networks. Comput. Ind. 112, 103121 (2019).
Jia, X. J., Wang, Y. X. & Wang, Z. Fatigue crack detection based on semantic segmentation using DeepLabV3 + for steel girder bridges. Appl. Sciences-Basel. 14 (18), 8132 (2024).

