Background-Aware Pooling and Noise-Aware Loss for Weakly-Supervised Semantic Segmentation

Youngmin Oh

Beomjun Kim

Bumsub Ham

Yonsei University

Visual comparison of pseudo ground-truth labels. Our approach generates better segmentation labels than other WSSS methods using object bounding boxes (WSSL and SDI). Hand-crafted methods (GrabCut and MCG) fail to segment object boundaries. For MCG, we compute intersection-over-union (IoU) scores using pairs of segment proposals and bounding boxes, and choose the best one for each box. Ours*: Ours with an indication of unreliable regions.

Abstract

We address the problem of weakly-supervised semantic segmentation (WSSS) using bounding box annotations. Although object bounding boxes are good indicators to segment corresponding objects, they do not specify object boundaries, making it hard to train convolutional neural networks (CNNs) for semantic segmentation. We find that background regions are perceptually consistent in part within an image, and this can be leveraged to discriminate foreground and background regions inside object bounding boxes. To implement this idea, we propose a novel pooling method, dubbed background-aware pooling (BAP), that focuses more on aggregating foreground features inside the bounding boxes using attention maps. This allows to extract high-quality pseudo segmentation labels to train CNNs for semantic segmentation, but the labels still contain noise especially at object boundaries. To address this problem, we also introduce a noise-aware loss (NAL) that makes the networks less susceptible to incorrect labels. Experimental results demonstrate that learning with our pseudo labels already outperforms state-of-the-art weakly- and semi-supervised methods on the PASCAL VOC 2012 dataset, and the NAL further boosts the performance.

Method overview

Figure 1: Overview of image classification using BAP. We first extract queries $q_j$ using a feature map $f$ and a binary mask $M$ indicating a definite background. The queries $q_j$ are then used to compute an attention map $A$ describing the likelihood that each pixel belongs to a background. The attention map enables localizing entire foreground regions, leading to better foreground features $r_i$ . Finally, we apply a softmax classifier $w$ to the foreground features $r_i$ for each bounding box together with the queries $q_j$ . The entire network is trained with a cross-entropy loss.

Figure 2: Generating pseudo labels. We compute $u_0$ and $u_c$ using a background attention map and CAMs, respectively, which are used as a unary term for DenseCRF to obtain pseudo segmentation labels $Y_\text{crf}$ . We extract prototypical features $q_c$ for each class using the labels $Y_\text{crf}$ , and use them as queries to retrieve high-level features from the feature map $f$ , from which we obtain additional pseudo labels $Y_\text{ret}$ .

Our approach mainly consists of three stages: First, we train a CNN for image classification using object bounding boxes (Fig. 1). We use BAP leveraging a background prior, that is, background regions are perceptually consistent in part within an image, allowing to extract more accurate CAMs. To this end, we compute an attention map for a background adaptively for each image. Second, we generate pseudo segmentation labels using CAMs obtained from the classification network together with the background attention maps and prototypical features (Fig. 2). Finally, we train CNNs for semantic segmentation with the pseudo ground truth but possibly having noisy labels. We use a NAL to lessen the influence of the noisy labels.

Experimental results

Table 1: Comparison of pseudo labels on the PASCAL VOC 2012 *train* and *val* sets in terms of mIoU. Numbers in bold indicate the best performance. We report the supervision types with the number of annotations. For MCG, we manually choose the segment proposal that gives the highest IoU score with each bounding box. ∗: pseudo labels contain unreliable regions.

Table 2: Quantitative comparison with state-of-the-art methods using DeepLab-V1 (VGG-16) on the PASCAL VOC 2012 dataset in terms of mIoU. Numbers in bold indicate the best performance and underscored ones are the second best.

Paper

Background-Aware Pooling and Noise-Aware Loss for Weakly-Supervised Semantic Segmentation
Y. Oh, B. Kim, B. Ham
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2021
[arXiv] [Github] [BibTex]

Acknowledgements

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) (NRF-2019R1A2C2084816).